powroupi / blender_mmd_tools

mmd_tools is a blender addon for importing Models and Motions of MikuMikuDance.
GNU General Public License v3.0
1.83k stars 277 forks source link

lip sync #90

Open Hogarth-MMD opened 6 years ago

Hogarth-MMD commented 6 years ago

https://sites.google.com/site/khuuyjblend/home/blender/script/lipsync

Here is a lip synchronization tool by a Japanese author which uses Blender, Python, and an open source speech recognition engine called Julius. It may be awhile before I have time to take a look at it, but maybe someone else might like to look at it now, so I am sharing the link now. Any thoughts about how this could be adapted to MMD and mmd_tools? The ideal would be a direct conversion of an audio file to a VMD talking animation, to speed up making a lip sync animation.

Hogarth-MMD commented 6 years ago

anim_julipsync.py

I installed this python add-on in Blender, but no UI for this add-on appears, and I have no idea how to make the UI appear(?). A panel header with the title Lip Sync appears at the bottom of the Properties panel at the right side of the 3D view, but the panel is completely empty.

Hogarth-MMD commented 6 years ago

Traceback (most recent call last): File "C:\Program Files\Blender Foundation\Blender\2.79\scripts\addons_contrib\anim_julipsync.py", line 542, in draw ANIM_PT_jlipsync.__handle = context.region.callback_add(draw_callback,(context,),'POST_PIXEL') AttributeError: 'Region' object has no attribute 'callback_add'

"blender": (2, 62, 0),

This add-on has a compatibility issue with the recent versions of Blender. I tried it with Blender 2.64 and the UI was successfully displayed. In the UI, I did not see any option to use an audio file as the input. This add-on is using the computer's network card, which would make some people nervous about their computer security, so that aspect is not ideal.

Hogarth-MMD commented 6 years ago

https://github.com/julius-speech/julius

Julius is on Github.

Hogarth-MMD commented 6 years ago

I read some of the Julius documentation. According to the documentation, Julius is able to use an audio file as input and output to a text file. In the text file are human language words and a list of the human language phonemes. Julius is speech-to-text software. But I have not found any information that Julius can output any information about the timing of the phonemes. The timing information would be essential to creating a key frame animation.

Hogarth-MMD commented 6 years ago

Hello @nagadomi ! 3 months ago, I asked this question about Julius lip sync , but I received no reply. Maybe I received no reply because of the Japanese-English language barrier. Can you please translate and ask my question in Japanese here: https://github.com/julius-speech/julius/issues/78

nagadomi commented 6 years ago

I asked this question about Julius lip sync , but I received no reply. Maybe I received no reply because of the Japanese-English language barrier.

Maybe it does not matter. library developers/researchers are often not interested in application and user support.

I reply instead, You can get the timing of the phonemes (frame id) with -palign option.

usage:

First, convert the audio file to mono/16000HZ wav that is the input format required by julius.

ffmpeg -i audio.mp3 -acodec pcm_s16le -ac 1 -ar 16000 audio_mono_16000.wav

Then run julius. (dictation-kit of julius is stored at ../dictation-kit)

echo audio_mono_16000.wav | julius -C ../dictation-kit/am-gmm.jconf -C ../dictation-kit/main.jconf -palign -1pass -input rawfile

Result (The input file I tested contains ko re ha ma i ku no te su to de su):

....
=== begin forced alignment ===
-- phoneme alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   68]  -19.388691  silB
[  69   73]  -27.776392  k+o
[  74   81]  -29.069199  k-o+r
[  82   87]  -28.384277  o-r+e
[  88   92]  -30.289234  r-e+w
[  93  100]  -28.671051  e-w+a
[ 101  107]  -28.113108  w-a+m
[ 108  116]  -27.141819  a-m+a
[ 117  125]  -29.584555  m-a+i
[ 126  137]  -30.973288  a-i+k
[ 138  141]  -27.644470  i-k+u
[ 142  150]  -30.328804  k-u+n
[ 151  155]  -28.697704  u-n+o
[ 156  162]  -27.159250  n-o+t
[ 163  170]  -26.660339  o-t+e:[o-t+e]
[ 171  180]  -25.826904  t-e:+s
[ 181  185]  -24.926464  e:-s+u[e-s+u]
[ 186  188]  -28.368002  s-u+t
[ 189  195]  -27.901438  u-t+o
[ 196  202]  -28.325823  t-o+d
[ 203  207]  -26.419239  o-d+e
[ 208  217]  -23.810205  d-e+s
[ 218  227]  -23.197510  e-s+u
[ 228  232]  -28.345703  s-u
[ 233  327]  -20.266222  silE
re-computed AM score: -7800.596191
=== end forced alignment ===

The first column is in the form [$(begin frame id) $(end frame id)]. You can calculate timing from frame id.

phonemes begin time in milliseconds = $(begin frame id) * 10 + (12.5 if ($begin frame ID) != 0 else 0);
phonemes end time in milliseconds = ($(end frame id) + 1) * 10 + 12.5;

Edit: fix offset 25ms->12.5ms

ref: https://github.com/julius-speech/segmentation-kit/blob/78ece00b3c0e52f9281667a0d69e91558404816c/segment_julius.pl#L164-L168

Hogarth-MMD commented 6 years ago

Excellent, thanks! @nagadomi solved the puzzle! Hopefully I will be able to follow your instructions and get the same result.

nagadomi commented 6 years ago

Maybe you can also use pocketsphinx instead of julius. https://cmusphinx.github.io/wiki/phonemerecognition/ https://stackoverflow.com/a/30715295

Hogarth-MMD commented 6 years ago

C:\dictation-kit>nagadomi.bat

C:\dictation-kit>echo audio_mono_16000.wav | julius -C "C:/dictation-kit/am-gmm.jconf" -C "C:/dictation-kit/main.jconf" -palign -1pass -input rawfile STAT: include config: C:/dictation-kit/am-gmm.jconf STAT: include config: C:/dictation-kit/main.jconf STAT: jconf successfully finalized STAT: *** loading AM00 _default Stat: init_phmm: Reading in HMM definition Error: init_phmm: failed to read C:/dictation-kit/model/phone_m/jnas-tri-3k16-gid.binhmm ERROR: m_fusion: failed to initialize AM ERROR: Error in loading model

C:\dictation-kit>

I did this: Saved the command line code (slightly edited) to a Windows .bat file called nagadomi.bat Renamed dictation-kit-master to dictation-kit and copied it to my C:\ directory Copied nagadomi.bat and my correctly resampled audio_mono_16000.wav file into C:\dictation-kit Copied all of the files from C:\dictation-kit\bin\windows (including julius.exe) into C:\dictation-kit Ran nagadomi.bat from the Windows command line with administrator privileges The above error message resulted

nagadomi commented 6 years ago

Now i have tested on windows. The issue is, model files are not correctly included in dictation-kit-master.zip because dictation-kit's repo uses Git LFS(Git addon for large files).

How to download dictation-kit:

  1. Install Git FLS(and git, If you have not installed it) https://git-lfs.github.com/
  2. Restart the command prompt if it is open.
  3. Clone dictation-kit repo.
    git clone https://github.com/julius-speech/dictation-kit.git

    and check the file size of model/phone_m/jnas-tri-3k16-gid.binhmm. It is 11MB if it is correct.

Edit: test file I used: https://raw.githubusercontent.com/julius-speech/segmentation-kit/master/wav/sample.wav reuslt on windows.

>  echo sample.wav | .\julius.exe -C .\dictation-kit\am-gmm.jconf -C .\di
ctation-kit\main.jconf -palign  -charconv utf-8 oem -fallback1pass -input rawfile
....
ALIGN: === phoneme alignment begin ===
sentence1:  今日 は いい 天気 だ 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 天気+名詞 だ+助動詞 </s>
phseq1: silB | ky o: | w a | i: | t e N k i | d a | silE
cmscore1: 0.579 0.384 0.458 0.110 0.348 0.497 1.000
score1: -4648.166016
=== begin forced alignment ===
-- phoneme alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   20]  -19.319162  silB
[  21   30]  -25.296345  ky+o:[ky+o]
[  31   54]  -20.190138  ky-o:+w[y-o:+w]
[  55   67]  -23.106276  o:-w+a[o-w+a]
[  68   78]  -27.229338  w-a+i:[w-a+i]
[  79   97]  -24.443262  a-i:+t
[  98  106]  -23.263319  i:-t+e[i-t+e]
[ 107  114]  -23.706543  t-e+N
[ 115  125]  -24.538818  e-N+k
[ 126  136]  -23.495716  N-k+i
[ 137  141]  -24.824121  k-i+d
[ 142  151]  -24.408911  i-d+a
[ 152  161]  -24.646778  d-a
[ 162  203]  -18.818771  silE
re-computed AM score: -4540.023438
=== end forced alignment ===

It seems that -fallback1pass is bettern then -1pass. With some wav files, 2pass will fail, but if you specify -fallback1pass, it seems that only 1pass will be executed if 2pass fails.

Hogarth-MMD commented 6 years ago

So why does this large files issue happen only on Windows? Why does not this large files issue also happen on Linux?

nagadomi commented 6 years ago

No, I have cloned git repo on Linux with Git-LFS, so I was not aware of that issue. Downloading from https://osdn.net/projects/julius/releases/66544 seems to contain all files.

Hogarth-MMD commented 6 years ago

OSDN download of Julius dictation kit (407 MB) https://osdn.net/projects/julius/downloads/66544/dictation-kit-v4.4.zip/

Hogarth-MMD commented 6 years ago

I don't understand what the results mean or how to make phonemes animation from them: t-e+N a-i:+t ky-o:+w[y-o:+w] We need timing information of each phoneme. The results are 3 phonemes with + - : [ ] .

nagadomi commented 6 years ago

First, I am not familiar with speech recognition field and julius, I am investigating about that now. unit is Triphone HMM format that is a representation of the continuity of the phonemes. Simply, we can ignore other than the first phonemes. So maybe we can change the command parameter of julius..

# use monophone model
echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile

and result is here.

=== begin forced alignment ===
-- phoneme alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   21]  -19.382792  silB
[  22   30]  -25.662249  ky
[  31   53]  -20.818394  o:
[  54   67]  -23.519619  w
[  68   74]  -27.360666  a
[  75   96]  -25.364830  i:
[  97  106]  -23.746216  t
[ 107  116]  -24.600780  e
[ 117  125]  -24.467638  N
[ 126  136]  -24.526390  k
[ 137  142]  -25.521524  i
[ 143  149]  -24.958252  d
[ 150  159]  -24.305761  a
[ 160  203]  -19.381775  silE
re-computed AM score: -4612.191895
=== end forced alignment ===

extracting phonemes without silB,siliE and sp ..

kyowaitenkida

clean and segmentation with romaji.

kyo wa i te n ki da

simplify to vowel, it will be MMD morph.

o a i e n i a

Note that just ignore n(ん) because there is no mouth movement. Some models do not have e(え), but I think it can be replaced by a(あ) + i(い) compound.

Hogarth-MMD commented 6 years ago

"phonemes begin time in milliseconds = $(begin frame id) 10 + (25 if ($begin frame ID) != 0 else 0); phonemes end time in milliseconds = ($(end frame id) + 1) 10 + 25;"

So the duration of each frame is 1/100 second?

Hogarth-MMD commented 6 years ago

We need something which is user-friendly for the average user of MMD. So if anyone has any ideas about how to make this more user-friendly, please share these ideas.

Hogarth-MMD commented 6 years ago

echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile -outfile

Adding the switch -outfile and sample.wav phonemes are printed to a text file called sample.out

nagadomi commented 6 years ago

So the duration of each frame is 1/100 second?

Yes, the duration of each frame is 0.01s, but frames other than 0 has 0.0125s offset. (In the above code, I described it is 25ms, but 12.5ms is correct :disappointed: )

This code will be helpful. https://github.com/julius-speech/segmentation-kit/blob/78ece00b3c0e52f9281667a0d69e91558404816c/segment_julius.pl#L151-L174

Hogarth-MMD commented 6 years ago

When you say that there is an offset of 1/80 second, I think this means that the first 1/80 seconds of the .wav audio is ignored and omitted from the analysis. Right?

Hogarth-MMD commented 6 years ago

And therefore the duration of the resulting lip animation will be 1/80 seconds shorter than the duration of the .wav audio recording. Right?

Hogarth-MMD commented 6 years ago

But the resulting lip animation and the .wav audio recording should both end at exactly the same time. Right? The last keyframe of the lip animation should happen at the exact time as the end of the .wav audio recording. Right?

Hogarth-MMD commented 6 years ago

Okay, I am already programming a new "Import Julius Data" add-on. I need to have a complete list of all phonemes which may appear in the .out file.

Hogarth-MMD commented 6 years ago

silB = beginning silence silE = end silence I don't know if these have any relationship to this annoying offset (1/80 second).

nagadomi commented 6 years ago

I tried writing code to convert the julius output to vowels only sentence, and frame id to time. I have not confirmed the actual motion yet, so I’m not sure this code is correct.

code: https://gist.github.com/nagadomi/2b8131ed5f50e375f306b146f8840d11 test file tts.wav: tts.wav.zip

Hogarth-MMD commented 6 years ago

I have been disconnected from the internet for the past 3 days because of a Windows malfunction. I haven't finished the Julius lip sync add-on that I was working on. Do want to make this Julius lip sync add-on @nagadomi ? This is completely okay with me. I am trying to do too many things at the same time, but I cannot do everything at the same time.

Hogarth-MMD commented 6 years ago

lip sync Julius to Blender MMD add-on https://sta.sh/022qfe96dpx2

Hogarth-MMD commented 6 years ago

Blender has an issue with frames per second. It looks to me that audio files are imported into the Blender video editor with a frame rate of 30 fps. And the maximum preview playback fps of a morphs animation (in Blender 2.79b) seems to be 30 fps. So I cannot test this add-on with any FPS other than 30 FPS.

Hogarth-MMD commented 6 years ago

That is my contribution to this effort. I have no plan to spend any additional time developing this add-on. If @nagadomi wants to be the developer of this add-on, that is completely okay with me. lip sync Julius to Blender MMD add-on is an importer of Julius .out text files. The mesh object of an MMD character must be the active object when importing the .out file. A menu item for this add-on appears in Blender's File, Import menu.

nagadomi commented 6 years ago

OK, I may develop it when I have spare time, but I can't promise it.

And the maximum preview playback fps of a morphs animation (in Blender 2.79b) seems to be 30 fps.

I often use 60 fps. You can change fps with Render tab/Dimensions/Frame Rate, and sync-mode on timeline windows to set AV-sync.

btw, I noticed that Julius strips complete silence frame(zero value). It causes a issue that sound sources and motion can not sync. A quick and dirty solution is to add noise to the sound source before processing.

sox tts.wav -p synth whitenoise vol 0.02 | sox -m tts.wav - tts_addednoise.wav

tts_addednoise.wav.zip

lieff commented 6 years ago

I have crazy idea to use something like https://github.com/TadasBaltrusaitis/OpenFace to make lipsync from video, not audio. It can simplify creating some mimic effects without actual audio.

nagadomi commented 6 years ago

It's not so crazy idea, libraries that captures face morphs from the video such as Live2D/Facerig is already in use in virtual yutuber field. and, this guy have been developing a toolset to capture full motion from video. https://github.com/miu200521358/3d-pose-baseline-vmd video: http://www.nicovideo.jp/watch/sm33232251

Edit: "read facial expression from photo / movie and generate VMD motion" https://github.com/errno-mmd/readfacevmd

lieff commented 6 years ago

Cool :) It's already been done. I'll look at it later and try to integrate to saba or something, to achieve real-time preview & record from webcam.

Hogarth-MMD commented 6 years ago

lip sync Julius to Blender MMD add-on https://sta.sh/07g7actoiqa

Instructions:

Julius dictation kit osdn.net project releases page: https://osdn.net/projects/julius/releases/66544

Julius dictation kit osdn.net download link: https://osdn.net/projects/julius/downloads/66544/dictation-kit-v4.4.zip/

After unzipping the downloaded file (and possibly copying it to the topmost directory of your hard drive), rename the folder dictation-kit-master to dictation-kit .

Windows (and Mac OSX and Linux) instructions:

Copy all of the files from dictation-kit\bin\windows (including julius.exe) into the dictation-kit folder

(I guess that you should copy the files from dictation-kit\bin\osx if you are using Macintosh OSX, and that you should copy the files from dictation-kit\bin\linux if you are using linux.)

Convert a speech audio file to an uncompressed mono, 16000 samples-per-second, .wav audio file which is the input format required by Julius. Here is an example audio file that you can use: https://raw.githubusercontent.com/julius-speech/segmentation-kit/master/wav/sample.wav

Copy the following command line to a text file and save it to a file with a .bat extension. (Save it to a .sh file if you are using Mac OSX or Linux.):

echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile -outfile

(The above command line is for an audio file named sample.wav. So, if your audio file is not already named sample.wav, either rename your audio file to sample.wav, or edit the command line to refer to the correct name of your audio file. The name of your audio file must not have any spaces in it.)

Copy this .bat file (or .sh file) and the correctly resampled .wav audio file into the dictation-kit folder.

Run the .bat (or .sh) file from the command line. This should output a file named sample.out (or audiofilename_whatever.out). You can then find this .out file in the dictation-kit folder.

Import this .out file into Blender. Select File, Input, Julius import operator. Leave the import frames per second at 30 FPS. The mesh object of an MMD model must be the active object. You should then see a (not-perfect) lip sync animation on your MMD model when you playback the animation.

Hogarth-MMD commented 6 years ago

So, if we want Julius lip sync animation for Blender and MMD, there are basically 2 sets of unsolved problems:

  1. Needing an algorithm to make smooth transitions from one phoneme to the next phoneme.

  2. We want user friendliness. Someone just needs to navigate to, and select, an audio file and then the lip sync animation is automatically created in Blender. Remote controlling Julius and sending command line arguments to Julius. Compatibility with 3 different operating systems. Possible issues with needing administrator privileges to run Julius. Issues related to audio file does not have the correct format for Julius. Possible issue of user has not correctly copied files from the bin folder.

I have no idea how to program any of this. This is outside my python programming experience.

Hogarth-MMD commented 5 years ago

Download LipSynchloid lip sync plug-in (Windows dll) for MikuMikuMoving: https://bowlroll.net/file/29218

The download password is posted in the topmost user comment on the download page. Someone else posted a comment saying that the password does not work, but it really does work. Javascript must be enabled in your web browser.

LipSynchloid video tutorials: https://www.youtube.com/watch?v=UIWbeCqvj5c

https://www.nicovideo.jp/watch/sm22506025