jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.66k stars 749 forks source link

Add multi-edit capabilities to Speech Editing #94

Open pgosar opened 7 months ago

pgosar commented 7 months ago

This pull request implements a heavily modified edit distance algorithm to handle doing multiple edits at the same time. It also gets rid of the need for the user to specify the edit type(s), everything is handled automatically.

Known issues:

  1. Like the previous implementation, edits to the last index of the input sentence do not work. This looks like an issue of the model's inference, as in both my and the original implementation these changes are simply not recognized.

  2. Furthermore, multiple edit types cannot happen at the same time. For example, mix and matching substitutions with insertions crashes in inference. This is again something I need to look into still. Is this a limitation of the model itself?

I'd appreciate some help testing any other edge cases in the speech editing jupyter notebook if anyone is interested - I believe I have them all covered but more testing can't hurt :)

I will update the Google Colab for speech editing once this is merged.

jasonppy commented 7 months ago

Thanks! Really helpful contribution!

  1. can't edit the last index of input utterance: Yes, in the edit mode, the model doesn't supports that. However, editing a span that contain the last index is basically zero-shot TTS, so TTS mode supports that natively. We can simply flag an error when a user try to edit the last index and encourage them to use the TTS mode
  2. multiple edits cannot happen at the same time. Do you mean when you call inference_one_sample it will crash if mask_interval contains multiple entries e.g. [[15,94], [142,309]]? That shouldn't happen as long as the spans are not overlapping.
pgosar commented 7 months ago

I meant as in multiple types of edits. If I try to do a deletion and a substitution at the same time, for example:

original: "But when I had approached so near" new: "But had I approached so near" (substitute when->had, delete the had in the original)

The inference fails with the following error (I'll edit once it finishes running again)

However if I want multiple different insertions or deletions or substitutions, everything will just work as long as I don't mix and match. for example new: "insertion But had I approached insertion so near" works fine, with two separate insertions

jasonppy commented 7 months ago

I see for this example original: "But when I had approached so near" new: "But had I approached so near" (substitute when->had, delete the had in the original)

the reason it fails is probably because I used margin to extend the masked span, since there is only one word "I" in between the two edited spans, with margin, the two spans end up overlapping

pgosar commented 7 months ago

I see, I am doing more testing right now and I think you're right, supplying multiple different types of edits seems to work as long as there is a sizeable gap between them.

So doing something like this on words right next to each other can only work if the margin size is small enough? Not sure if this is something I can fix - do you have any suggestions? I can probably just throw an error instead and suggest they lower the margin, along with when editing the very last word like you mentioned.

jasonppy commented 7 months ago

regarding the issue of spans being two close: approach 1: set a threshold, say 2 words, and it the gap between two spans is less than or equal to 2 words, you will merge that into one span approach 2: margin is a hyperparameter that can be specified by the user (it's default at 0.08 second), and if the two spans will be overlapped with the specified margin, we automatically change it to a smaller value to make sure they don't overlap

Both approaches are sensible to me.

jasonppy commented 7 months ago

If you want to do large scale testing https://github.com/jasonppy/VoiceCraft/blob/master/RealEdit.txt contains 310 speech editing examples, and there are 40 2-span edits examples.

to interpret the example:

ah, but we'll talk about it because i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.  ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in the consistency of knowledge. 7,8|12,13   8,15|20,21  insertion|substitution

| is used as separation symbol. the above example should be interpreted as:

a|b b|c orig_start1,orig_end1|orig_start2,orig_end2 new_start1,new_end1|new_start2,new_end2

where a is the original transcript, c is the target transcript. [orig_start1,orig_end1] is the word index of the first span to mask etc.

pgosar commented 7 months ago

Are there any drawbacks to lowering the margin I should be aware of? The cases where my algorithm breaks don't break if I lower it to 0.02secs, so this should be an easy solution. I can constantly lower the margin until the spans align properly to make sure it works in all cases.

orig: But when I had new: But I did

jasonppy commented 7 months ago

The only drawback is that the forced alignment might not be perfect, and a larger margin gives room for such a mistake, also a large margin ensure modification of the neighboring (but not changing) words to have a smooth transition next to the changing words.

Therefore default it at 0.02sec wouldn't be great

pgosar commented 7 months ago

I used the margin fix. Regenerates the mask_interval as necessary with decreasing margins until no overlaps happen. The amount to decrease by is a hyperparameter, 0.01 by default.

allisonth commented 7 months ago

Hi I'm interested in testing multi-span editing algorithm.

pgosar commented 6 months ago

@jasonppy should be ready to merge.

The example original and target transcripts uses a pretty complex set of changes just to show what is now possible

allisonth commented 6 months ago

The algorithm seems to work from my testing. @jasonppy For more extensive testing could I get the wav files from the RealEdit dataset? I can only find the txt file mentioned above.