phone boundary between continuous vowels

Liujingxiu23 commented 2 years ago

@petronny Hi, I have trained the model using chinese dataset successfully. But I meet a problem, the bounary of continuous vowels is not as correct as other phones. For example "我安心的点点头"， phone boundarys between “我” 和 “安”，“o3” and "an1" ，are wrong. And this kind of problems happen frequently.

For syllables like "yun1"(云) I can split to "y vn1" where “y” has a certain duration value, for "wu2"(无) I can split to "w u2" where "w" has a duration value. But for some vowel, for example "安/an" "阿/a”, there is really no consonant at all.

Have you found problems like this, how did you solve the problem?

petronny commented 2 years ago

Well, it would be a superise to me if NeuFA (or any other FA model) predicts some insane boundaries.

Like the paper said, the 50 ms tolerance accuracy of NeuFA is 95% at word level. It seems to be high. But in practice, for a sentence with 20 phonemes in example. The possibilty that there is a phoneme with a predicted boundary 50ms biased from the ground-truth is 1 - .95 ^ 20 = 64.15%. Similarly, the possibilty that there is a phoneme with a predicted boundary 100ms biased from the ground-truth is 1 - .98 ^ 20 = 33.24%.

Also, NeuFA currently doesn't restrict the predicted boundaries to be nonoverlapping (we are working on this in NeuFA 2), which makes the situation even worse.

So my opinion is NeuFA is not ready for production enviroments yet. But NeuFA could be used as a "soft" FA model which extracts the attention weights between the text and speech to map the information between them. And this is exactly why we propose NeuFA and how we use it in our other researches.

Hope this will answer your question.

Liujingxiu23 commented 2 years ago

@petronny Thank you for your reply!

"nonoverlapping" and "fixed thred=0.5" make boundaries not very clear, and the results are hard to use even though most of the results are really good.
Can you share the code "extracts the attention weights between the text and speech to map the information between them"?

petronny commented 2 years ago

the results are hard to use even though most of the results are really good.

I agree with that. We are working on the nonoverlapping issue.

Can you share the code to "extract the attention weights between the text and speech to map the information between them"?

See https://github.com/thuhcsi/NeuFA/blob/master/inference.py#L112 , I mainly uses the attention weights from the ASR direction.

Liujingxiu23 commented 2 years ago

Get it! Thank you again! @petronny

Liujingxiu23 commented 2 years ago

I tried w_tts and w_asr at phone level, but the results are both bad since the result of the first phone "silence" of each sentence has a big difference from the ground truth. I did not why。 Then I tried weight = boundary_left - boudary_right for each phone (the weight values are 1 in the middle of the phone, and about 0 in the border of the phone) and ues functions in "https://github.com/as-ideas/DeepForcedAligner/blob/main/dfa/duration_extraction.py" to extract durations. Then I can get a continues , no overlap alignment.

panxin801 commented 2 years ago

Well , in fact I meet the similar problem with you. In my experiment, align is not even Monotony. Which means end time of a word is earlier than the start time of the word. And this made this great work suits for real scenario I think . Your idea may work I think , thank you

panxin801 commented 2 years ago

The bad case is like this

intervals [18]:
                xmin = 7.36
                xmax = 7.48
                text = "the"
            intervals [19]:
                xmin = 7.48
                xmax = 7.26 # watch here
                text = "assassination"
            intervals [20]:
                xmin = 7.71
                xmax = 7.93
                text = "of"
            intervals [21]:
                xmin = 7.95

Liujingxiu23 commented 2 years ago

@panxin801 May be you can use weight/attn = boundary_left - boundary_right for test. @petronny I also found that in statistics as the paper showed, NeuFA is much better than MFA in my experiment using chinese dataset. But in some cases, the phones boundaries has very large deviation from the groundtruth. “Very large Error” ，for examples， larger than 5 frames， happens more than MFA.

panxin801 commented 2 years ago

@Liujingxiu23 Yeah, I have the same conclusion with you, Chinese result better than English in average. And thank you for your advice

thuhcsi / NeuFA

phone boundary between continuous vowels #5