Open pgosar opened 7 months ago
Thanks! Really helpful contribution!
inference_one_sample
it will crash if mask_interval
contains multiple entries e.g. [[15,94], [142,309]]? That shouldn't happen as long as the spans are not overlapping.I meant as in multiple types of edits. If I try to do a deletion and a substitution at the same time, for example:
original: "But when I had approached so near" new: "But had I approached so near" (substitute when->had, delete the had in the original)
The inference fails with the following error (I'll edit once it finishes running again)
However if I want multiple different insertions or deletions or substitutions, everything will just work as long as I don't mix and match. for example new: "insertion But had I approached insertion so near" works fine, with two separate insertions
I see for this example original: "But when I had approached so near" new: "But had I approached so near" (substitute when->had, delete the had in the original)
the reason it fails is probably because I used margin
to extend the masked span, since there is only one word "I" in between the two edited spans, with margin, the two spans end up overlapping
I see, I am doing more testing right now and I think you're right, supplying multiple different types of edits seems to work as long as there is a sizeable gap between them.
So doing something like this on words right next to each other can only work if the margin size is small enough? Not sure if this is something I can fix - do you have any suggestions? I can probably just throw an error instead and suggest they lower the margin, along with when editing the very last word like you mentioned.
regarding the issue of spans being two close: approach 1: set a threshold, say 2 words, and it the gap between two spans is less than or equal to 2 words, you will merge that into one span approach 2: margin is a hyperparameter that can be specified by the user (it's default at 0.08 second), and if the two spans will be overlapped with the specified margin, we automatically change it to a smaller value to make sure they don't overlap
Both approaches are sensible to me.
If you want to do large scale testing https://github.com/jasonppy/VoiceCraft/blob/master/RealEdit.txt contains 310 speech editing examples, and there are 40 2-span edits examples.
to interpret the example:
ah, but we'll talk about it because i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge. ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in the consistency of knowledge. 7,8|12,13 8,15|20,21 insertion|substitution
|
is used as separation symbol. the above example should be interpreted as:
a|b b|c orig_start1,orig_end1|orig_start2,orig_end2 new_start1,new_end1|new_start2,new_end2
where a
is the original transcript, c
is the target transcript. [orig_start1,orig_end1] is the word index of the first span to mask etc.
Are there any drawbacks to lowering the margin I should be aware of? The cases where my algorithm breaks don't break if I lower it to 0.02secs, so this should be an easy solution. I can constantly lower the margin until the spans align properly to make sure it works in all cases.
orig: But when I had new: But I did
The only drawback is that the forced alignment might not be perfect, and a larger margin gives room for such a mistake, also a large margin ensure modification of the neighboring (but not changing) words to have a smooth transition next to the changing words.
Therefore default it at 0.02sec wouldn't be great
I used the margin fix. Regenerates the mask_interval
as necessary with decreasing margins until no overlaps happen. The amount to decrease by is a hyperparameter, 0.01 by default.
Hi I'm interested in testing multi-span editing algorithm.
@jasonppy should be ready to merge.
The example original and target transcripts uses a pretty complex set of changes just to show what is now possible
The algorithm seems to work from my testing. @jasonppy For more extensive testing could I get the wav files from the RealEdit dataset? I can only find the txt file mentioned above.
This pull request implements a heavily modified edit distance algorithm to handle doing multiple edits at the same time. It also gets rid of the need for the user to specify the edit type(s), everything is handled automatically.
Known issues:
Like the previous implementation, edits to the last index of the input sentence do not work. This looks like an issue of the model's inference, as in both my and the original implementation these changes are simply not recognized.
Furthermore, multiple edit types cannot happen at the same time. For example, mix and matching substitutions with insertions crashes in inference. This is again something I need to look into still. Is this a limitation of the model itself?
I'd appreciate some help testing any other edge cases in the speech editing jupyter notebook if anyone is interested - I believe I have them all covered but more testing can't hurt :)
I will update the Google Colab for speech editing once this is merged.