met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
349 stars 107 forks source link

Italian Lipsync #54

Open lupettohf opened 3 months ago

lupettohf commented 3 months ago

Italian Lipsync preprocessor :D

met4citizen commented 3 months ago

Great. Thank you for sharing!

I tried the class in my test environment, but I couldn't get it to work. It always ended up in an infinite loop. I also tried calling the methods directly, for example, wordsToVisemes("Cappello"), and it resulted in an infinite loop as well.

I think the reason for this is that for each letter, the last rule should be the letter itself and the most common viseme. For example, for "A", the last rule should be "[A]=aa" and so on. Without this default rule, the process can start to repeat itself, resulting in an infinite loop. Since the rules are applied in order, exceptions to the most common visemes should be defined first.

As this problem occurs almost every time, I encourage you to ensure that in your setup, you are actually using this lip-sync module and not one of the existing modules.

Additionally, in the rules, there are now a lot of viseme names such as bb, v, j etc. that are not valid Oculus OVR viseme codes. Please check the valid codes in README's Appendix C.

I don't speak Italian, and I know quite little about Italian phonology, but if you have time to work on these issues, I can help you test the module. Italian is a phonetically orthographic language, and in that regard, it resembles Finnish.

lupettohf commented 3 months ago

Thanks for the clarification, indeed I was using the default en processor. Here is a test with the new one:

It preprocessor:

cappello-it
met4citizen commented 3 months ago

That looks promising, and I also got the class working in my test environment. Here is a short (unlisted) video clip I recorded: https://youtu.be/fw17X7cmvx8

The lip-sync accuracy is not too bad, but the rules still need some adjustment. This is often the trickiest part. The rules don't have to be perfect, of course, just enough to maintain the illusion.

First, you should check that all those right hand side visemes in your rules are indeed valid Oculus visemes, which are: 'aa', 'E', 'I', 'O', 'U', 'PP', 'SS', 'TH', 'CH', 'FF', 'kk', 'nn', 'RR', 'DD', 'sil'.

One issue that I noticed in your screenshot is the duration your rule set gives to the viseme aa. It seems too long. The reason is most likely the rule "[C]A=kk aa". This is what probably happens: At first, the pointer is at the first letter C (Cappello, the capital letter indicating where the pointer is). As this matches the rule's left-hand side (CA), everything inside the square brackets (C) is skipped and replaced with the right-hand side visemes (kk aa), and the pointer is moved to the remaining part (cAppello). During the next iteration, the A is converted to viseme aa. This means that the CA actually becomes kk aa aa. Double visemes get combined in the code, and the result is one long viseme aa, which is, I believe, not right. This same issue applies to some other rules, too. If you review those cases, you can actually remove a lot of the rules and at the same time improve the lip-sync accuracy.

The rule logic might seem a bit confusing at first, but you can always refer to the original 1976 US Navy paper.

A good way to verify your rule set is to download some open-source Italian phoneme dictionary, convert its phonemes to visemes, and then run each word through your class and compare the results. This would give you a percentage indicating how accurate your rule set is.

lupettohf commented 3 months ago

Yesterday I did some live tests and they where acceptable, but yes, some words works better than others. I took a look at the demo you made, the lipsync in that case was almost spot on. I'll keep working on it.