Feature Request - Speech / Phonetics automatic generation/ alignment

merlin2v commented 6 years ago

I've been wondering why the file text had to be used. Couldn't you separate the sound via phonetics? This would be better as it would help translate things more accurate than the text alone. take the following example:

I do like

This could be said as:

adɪ̈ lik _(IPA)_

vs. someone being using pronunciation:

a i dɵ lik _(IPA)_

Both of these end up using different mouth movements and because of this can make some of the mouth movements off.

morevnaproject commented 6 years ago

I was thinking about that too, and started to investigate. This is what I found - https://cmusphinx.github.io/wiki/phonemerecognition/

Frequently, people want to use Sphinx to do phoneme recognition. In other words, they would like to convert speech to a stream of phonemes rather than words. This is possible, although the results can be disappointing. The reason is that automatic speech recognition relies heavily on contextual constraints (i.e. language modeling) to guide the search algorithm.

For now, I think integrating with RhubarbLipSync (#44) is a way to go.

morevnaproject commented 5 years ago

We've got Rhubarb feature merged just now. - #50 ^__^

Hunanbean commented 3 years ago

Montreal Forced Aligner may be something to look into, but that would be more for automatic alignment from the text, rather than the full shebang.

steveway commented 3 years ago

As mentioned we currently have Rhubarb integrated. But I just found an interesting project for this call Allosaurus. It seems to be pretty easy to use and here on Windows 10 it was very easy to pip install it. The only problem is that it outputs IPA Phonemes and that it does not provide any timestamps (yet). We should be able to create a mapping for the phonemes to the ones we support already. And for the timestamps there is already an open issue: https://github.com/xinjli/allosaurus/issues/24 But even without timestamps it might already be usable with some conversion of the phonemes.

steveway commented 3 years ago

Here is a very simple conversion dict from IPA to CMU: { "b": "B", "ʧ": "CH", "d": "D", "ð": "DH", "f": "F", "g": "G", "h": "HH", "ʤ": "JH", "k": "K", "l": "L", "m": "M", "n": "N", "ŋ": "NG", "p": "P", "r": "R", "s": "S", "ʃ": "SH", "t": "T", "θ": "TH", "v": "V", "w": "W", "j": "Y", "z": "Z", "ʒ": "ZH", "ɑ": "AA2", "æ": "AE2", "ə": "AH0", "ʌ": "AH2", "ɔ": "AO2", "ɛ": "EH2", "ɚ": "ER0", "ɝ": "ER2", "ɪ": "IH2", "i": "IY2", "ʊ": "UH2", "u": "UW2", "aʊ": "AW2", "aɪ": "AY2", "eɪ": "EY2", "oʊ": "OW2", "ɔɪ": "OY2" } It's based on this mapping from CMU to IPA. https://github.com/margonaut/CMU-to-IPA-Converter/blob/master/cmu_ipa_mapping.rb

Hunanbean commented 3 years ago

I must have underestimated what Rhubarb actually does. I will take a look at it now. In the CMU phoneme set i did, i purposely simplified it to remove the specific variants, such as AO1, AO2 would both become just AO. But, i am pretty sure i still have the full setup before i truncated it if you want me to post that on my git. But due to the imperceivable differences between AO, AO1, AO2 in action, perhaps it makes more sense to just have the conversion dictionary truncate to the existing set of 39

steveway commented 3 years ago

Yes, Rhubarb is quite nice, it would be awesome if it could also output text besides phonemes. With our language dictionaries we could try to convert phonemes back to words too for that. The results from Rhubarb are not as exact as our manual methods, but I guess for most animations it's enough. I don't think we need the untruncated list for CMU. I just quickly generated that list up there based on that little converter from @margonaut. If we really want to integrate Allosaurus then we should make a fitting conversion table for our phoneme list. We can use that information to create a new phoneme_set and phoneme_conversion dictionary for IPA. And we should add some code to use these to convert between different phoneme sets, that should already kinda be possible in a limited way.

steveway commented 3 years ago

I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus This currently uses pydub to prepare the sound files for allosaurus. This works very well with our Tutorial Files, even the spanish ones. The results seem to be better than what Rhubarb provides. Here is a quick test showing the result for running it on the lame.wav file: https://youtu.be/4hqHaEXo9xU The phonemes are partially overlapping, so some pruning needs to be done for animation purposes. But as you can see the results are quite good.

Hunanbean commented 3 years ago

That is very cool! Thank you I just ditched windows and went back to linux last night, so it is going to take me a little while before i can test it. Seems like now would be a good time to start making some noise on the forums. That looks like a Patreon magnet to me!

On Wed, Jun 2, 2021 at 1:03 AM Stefan Murawski @.***> wrote:

I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus This currently uses pydub to prepare the sound files for allosaurus. This works very well with our Tutorial Files, even the spanish ones. The results seem to be better than what Rhubarb provides. Here is a quick test showing the result for running it on the lame.wav file: https://youtu.be/4hqHaEXo9xU The phonemes are partially overlapping, so some pruning needs to be done for animation purposes. But as you can see the results are quite good.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/morevnaproject-org/papagayo-ng/issues/49#issuecomment-852830369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMNVR7D3RQZHLOXBOMRFTILTQXQUTANCNFSM4FYFKTXQ .

steveway commented 3 years ago

Alright, I made some more progress. Based on the time between phonemes from the automatic recognition I chunk the phonemes together to single "words". This should make editing the results a bit easier. Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts. Also at the moment I convert first to CMU39 and then from that to the desired ones. For the best results we should create conversions between each set manually.

I think the direct next step would be to add some GUI option to change the selected phonemes/words/phrases to belong to a different voice. That way after auto recognition did it's work we just need to separate the parts to the different voices, if there are any and then it's pretty much done.

And for the future some automatic speaker diarization would be awesome, that way we could automate almost everything.

steveway commented 3 years ago

Ok, it's now available. I've created a first pull request for this: https://github.com/morevnaproject-org/papagayo-ng/pull/94 The results are pretty good, I would recommend everyone to test this. You should download FFmpeg and the allosaurus AI model using the actions at the top of Papagayo-NG first, then restart Papagayo-NG and it should all work.

Hunanbean commented 3 years ago

It is working and Very impressive! Thank you very much!

aziagiles commented 3 years ago

@steveway Everything is working except the 'Convert Phonemes' function which doesn't really function well. I think it still needs some fixing. I'm on a windows 10 platform.

aziagiles commented 3 years ago

@steveway Also for the default 'cmu_39' automatic breakdown, the breakdown goes well for all the earlier parts of the input audio but it doesn't breakdown the end of the audio.

steveway commented 3 years ago

Yes, the conversion needs some work, as I mentioned:

Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts. Also at the moment I convert first to CMU39 and then from that to the desired ones. For the best results we should create conversions between each set manually.

Can you show what files it does not break down all the way? The test files worked pretty well, the automatic breakdown is being done by https://github.com/xinjli/allosaurus so there might not be much we can do depending on the cause.

Hunanbean commented 3 years ago

Yes, there appears to be a problem with it truncating the last about .5 seconds of the audio file. I will see if it works if i add some empty time at the end of the audio file.

aziagiles commented 3 years ago

@steveway Ok. I made a video of the 2 worries I had. That of the conversion from CMU39 to Preston Blair not working properly as alot of missing phonemes are notified. The second worry was that the last phrase or words in the audio are not broken down.

https://user-images.githubusercontent.com/9162114/124361546-89e12100-dc27-11eb-850d-45200b9caa2d.mp4

Hunanbean commented 3 years ago

Ok, after i added 1 second of silence to the end of the audio file, it now picks up the last phrase

Edit: I was mistaken, it still truncates the end. It was just the audio lining up to the end, not the actual conversion. The last word/words are still truncated

aziagiles commented 3 years ago

@Hunanbean I just added a second silence at the end of my audio, and after the breakdown, it ended exactly where my audio sound ended but unfortunately did not still pick the last phrase.

Hunanbean commented 3 years ago

Hmm. Perhaps it just cannot recognize the last phrase, or more silence needs to be added?

Edit: My mistake. You are correct. It is still truncating the end. It was just the audio now finishing at the correct spot

Hunanbean commented 3 years ago

yes, verified that any file I try, regardless of added silence at the end, does truncate the last phrase.

steveway commented 3 years ago

I think I found the cause. There is of course still the possibility that allosaurus can't recognize all the phonemes. But my code did accidentally skip the last few phonemes in some cases. The reason was the logic I used to chunk phonemes into possible "words", I use some peak detection of the time between the phonemes to split between possible words. If that result was uneven then the loop over that would skip the last one. I changed this a bit now: https://github.com/steveway/papagayo-ng/commit/fb9437798877d5494a4918cada2cede945773efb

Hunanbean commented 3 years ago

With the files I added one second of silence to, this now picks up that last phrase. However, the problem still remains on the same files without the added silence.

Thank you

steveway commented 3 years ago

Allosaurus is likely not picking up that part at all. Can you send a file which has this problem? I can test to see if it is really Allosaurus or something we do then.

Hunanbean commented 3 years ago

Here is an example. The jenna file says the same words, but is recognized without silence added. The salli file says the same words, but the last portion is only recognized with silence added. For the salli voice, both with and without silence are included. ZippityDoDa.zip

Thank you

steveway commented 3 years ago

I now add half a second of silence at the end for Allosaurus, if you then increase the emission to about 1.4 it recognizes your file to the end. It's a bit strange, without upping the emission it doesn't work, even if I add 10 seconds of silence. You can test this with a re-download and install of this release: https://github.com/steveway/papagayo-ng/releases/tag/1.6.1

Hunanbean commented 3 years ago

That is working well. Thank you

aziagiles commented 3 years ago

@steveway The 'Convert Phonemes' function not working properly in the current master allosaurus branch. When converting from CMU39 to Preston Blair, Rhubarb or Fleming Dobbs, alot of missing phonemes are notified.

steveway commented 3 years ago

@aziagiles I see, I fixed this just now in Commit: https://github.com/steveway/papagayo-ng/commit/fc971f19a58258a1cd114bf4150cc553a902e6c0 I also saw that there are some more accesses to the voices via the LipSyncDoc Objects voices list. All those accesses should be replaced by using the new Nodes. For now it should still work.

steveway commented 3 years ago

Sorry, I didn't want to close this, must have miss-clicked. Anyway, this is now pretty solid and I believe more stable than the old version.

aziagiles commented 3 years ago

@steveway The convert phonemes function now works great. Thanks for the fix. But please, how do we use the rhurbarb for automatic speech/phonetics generation/alignment in this version of Papagayo-NG especially as allosaurus conversion comes already automatic at the start when our audio is loaded.

steveway commented 3 years ago

Hello @aziagiles, before this was determined by some simply logic, first it will try Allosaurus and if that fails it will try Rhubarb. But I changed this now. With my new changes you have the option to manually select between Rhubarb and Allosaurus. For that there is a Combobox in the settings and the button for VoiceRecognition now has a little arrow which will pop out a menu where you can change that same selection. 5c88ef94205ad712d61d9d5a13d4325a3308f132 voice_selection

aziagiles commented 3 years ago

@steveway Hello Steve. I can't resize the audio text length in waveform from the end when using preston blair, fleming dobbs or rhubarb breakdowns.

aziagiles commented 3 years ago

@steveway Also, the Convert Phonemes function doesn't work from Rhubarb or Fleming dobbs breakdowns to Preston Blair.

aziagiles commented 3 years ago

@steveway Please, how does the Add, Remove and change Voice functions work?

steveway commented 3 years ago

Hi, I see. I believe I found the cause and fixed it. The problem was that for the right side constraint we fell back on "0" when we couldn't get the information from the parent objects. This is now fixed, it should now always get the last frame in this case. I also fixed the Conversions a bit. The problem is that the old Project files don't save the Phoneme Set used, and the new .pg2 also didn't. I updated the new ones to save it, the old ones can't be changed because this would create possible problems with other projects which try to use that. There is also now a new menu, one button tries to convert as before and the other one sets the currently selected set as the used one. So if you have an old project you can first select the phoneme-set and set it and then convert it to the destination. Adding and removing Voices can be done with the little + and - Buttons. You will see that a new Tab will appear for the newly created voice. You can switch by clicking on the tabs and remove one by clicking on - .

aziagiles commented 3 years ago

@steveway Hello Steve. I really appreciate the work you keep putting on this version of Papagayo-NG. I tested the recent version and noticed a slight error on the Convert Phonemes function. Conversion from Rhubarb to other phoneme breakdowns doesn't work.

steveway commented 3 years ago

I see why that is. The phoneme conversion for Rhubarb has never been adjusted: https://github.com/morevnaproject-org/papagayo-ng/blob/master/phonemes/rhubarb.json It seems it was based on the Preston Blair one, but nobody adjusted the phoneme_conversion dictionary yet.

steveway commented 3 years ago

I started a little tool which can help to create these conversions. For now it only displays the mouths for the CMU set and the Rhubarb one in a grid with some Checkboxes. There is no functionality behind the Checkboxes. But for visualizing this might help. Maybe instead of a bunch of Checkboxes we can use some Dropdowns. In that case we will show the Pictures for the CMU set as is now and right besides there will be a dropdown with the rhubarb pictures inside besides each one. Then you could go through the CMU phonemes one by one and select the best fitting Rhubarb one right next to it.

steveway commented 3 years ago

And another update. I changed the Helper Tool as I mentioned. It can also now save the changes. Currently some things are hard coded in that script, like the input and output paths. This is how this tool looks: phoneme_conversion_helper So, just start it, go through the list from the top to the bottom and select the fitting phonemes and at the bottom click on save. It will save a "test_set.json" file in the phonemes directory. If everything works out then switching the rhubarb.json with that should fix the problem with the conversions of that. I will likely improve the tool so it can show and modify any phoneme set. There are likely improvements we can make for the other sets.

Hunanbean commented 3 years ago

It is amazing what you have turned this software into. Thank you! I look forward to testing this later today.

p.s. If you note any improvements that should be made to the CMU mouth shapes, please let me know

aziagiles commented 3 years ago

You're right @Hunanbean . He @steveway has really turned this software into something very huge and awesome. But somehow, It will surprise you guys to know I haven't yet updated to Papagayo NG 1.6.4.1 but I still use Papagayo NG 1.6.3. I believe somehow, allosaurus and allosaurus conversion to other phoneme breakdown is better in version 1.6.3 than 1.6.4.1. In version 1.6.3 after conversion, the mouths change on a more normal and natural speed, while in version 1.6.4.1 after conversions, the mouths change extremely fast and not very natural.

steveway commented 3 years ago

Thanks everyone, I hope Papagayo-NG will be used more thanks to these improvements.

@aziagiles That is likely because of the Emission Multiplier. Try lowering the value slightly below 1 maybe.

I gave the Phoneme Conversion Helper Tool a big overhaul already. You can select which set you want to modify and then it will allow you to modify conversions for the other sets. phoneme_helper_new Here I selected the Rhubarb Set and it allows me to modify the conversions to CMU 39, Preston Blair and Fleming Dobbs as you can see. At the moment it will save it to a file with a slightly different name in the phonemes directory. And Papagayo-NG needs to be adjusted to use these new conversions if available. So, if anyone wants to try improving the conversions, go ahead. :thumbsup:

aziagiles commented 3 years ago

@steveway Ok I get it. Thanks for the detail explanation and improvements.

steveway commented 3 years ago

No Problem. I've done a few more changes now on my current work branch: https://github.com/steveway/papagayo-ng/tree/cli_test Papagayo-NG is now changed to use these new conversions. At the moment only the conversions from CMU_39 will do anything useful because only these have been created. With the help of the new helper tool we should be able to create the other conversions, these have to be created for every combination. But the end result should be able to be more correct than the current hack we have now.

steveway commented 3 years ago

Alright, I've improved the Helper Tool very slightly. And I've created some first conversions for all the combinations. You can test this out with this branch: https://github.com/steveway/papagayo-ng/tree/cli_test

https://user-images.githubusercontent.com/3737094/130054918-c442ab5d-4d39-4208-88fc-ca24ce928f1f.mp4

I added a new Option in the General Settings for the Playback, you can set it to hold the phonemes. It seems that the output from Allosaurus expects that we hold the last phoneme during playback instead of inserting "rest" phonemes like before. As you can see in that video this makes the playback look much more correct. This fixes the problem @aziagiles mentioned:

In version 1.6.3 after conversion, the mouths change on a more normal and natural speed, while in version 1.6.4.1 after conversions, the mouths change extremely fast and not very natural.

(There is still some problems with the end not always being correctly recognized, but that seems to be a problem from Allosaurus itself.)

steveway commented 3 years ago

Ok, I apparently didn't yet push the mentioned Playback change. It is now available in my CLI Test version. I also fixed a bug in my phoneme helper tool which saved the phonemes incorrectly before and I fixed the phoneme conversions accordingly. There might still be some improvements to be made but it's quite good now already. Note that conversion is a lossy process, not all sets can be mapped perfectly and you will lose information when converting.

Hunanbean commented 3 years ago

It appears to be working as it should, but i have limited testing ability at the moment. Very cool, thank you!

steveway commented 3 years ago

Hello, I merged a few changes now since they seem to work pretty well. The newest change also fixes a bug some of you experienced with Allosaurus. Sometimes it seemed it did not recognize everything to the end. The chunking to words I had produced a problem in some cases. This should now be fixed and all the phonemes Allosaurus recognized should appear.

aziagiles commented 3 years ago

@morevnaproject Hello Konstantin. I really believe we have a pretty good lip sync software here (Papagayo-NG). The problem now is, for us Blender 2.93.x users, the LipSync Importer & Blinker addon doesn't work in the Blender 2.93.x version. Even this version upgraded by iCEE for Blender 2.8 (https://github.com/iCEE-HAM/io_import_lipSync-blender2.8?fbclid=IwAR32EEQyncGUDDbvpBzEu1Hr5V6J1SPNz7aVXRIeABZEQKSM32CVqspvy3A) doesn't work in version 2.93.x. Please, I really pray the addon be upgraded as it would be very helpful for us using it in major projects.

morevnaproject-org / papagayo-ng

Feature Request - Speech / Phonetics automatic generation/ alignment #49