Open Hogarth-MMD opened 6 years ago
anim_julipsync.py
I installed this python add-on in Blender, but no UI for this add-on appears, and I have no idea how to make the UI appear(?). A panel header with the title Lip Sync appears at the bottom of the Properties panel at the right side of the 3D view, but the panel is completely empty.
Traceback (most recent call last): File "C:\Program Files\Blender Foundation\Blender\2.79\scripts\addons_contrib\anim_julipsync.py", line 542, in draw ANIM_PT_jlipsync.__handle = context.region.callback_add(draw_callback,(context,),'POST_PIXEL') AttributeError: 'Region' object has no attribute 'callback_add'
"blender": (2, 62, 0),
This add-on has a compatibility issue with the recent versions of Blender. I tried it with Blender 2.64 and the UI was successfully displayed. In the UI, I did not see any option to use an audio file as the input. This add-on is using the computer's network card, which would make some people nervous about their computer security, so that aspect is not ideal.
https://github.com/julius-speech/julius
Julius is on Github.
I read some of the Julius documentation. According to the documentation, Julius is able to use an audio file as input and output to a text file. In the text file are human language words and a list of the human language phonemes. Julius is speech-to-text software. But I have not found any information that Julius can output any information about the timing of the phonemes. The timing information would be essential to creating a key frame animation.
Hello @nagadomi ! 3 months ago, I asked this question about Julius lip sync , but I received no reply. Maybe I received no reply because of the Japanese-English language barrier. Can you please translate and ask my question in Japanese here: https://github.com/julius-speech/julius/issues/78
I asked this question about Julius lip sync , but I received no reply. Maybe I received no reply because of the Japanese-English language barrier.
Maybe it does not matter. library developers/researchers are often not interested in application and user support.
I reply instead,
You can get the timing of the phonemes (frame id) with -palign
option.
usage:
First, convert the audio file to mono/16000HZ wav that is the input format required by julius.
ffmpeg -i audio.mp3 -acodec pcm_s16le -ac 1 -ar 16000 audio_mono_16000.wav
Then run julius. (dictation-kit of julius is stored at ../dictation-kit
)
echo audio_mono_16000.wav | julius -C ../dictation-kit/am-gmm.jconf -C ../dictation-kit/main.jconf -palign -1pass -input rawfile
Result (The input file I tested contains ko re ha ma i ku no te su to de su
):
....
=== begin forced alignment ===
-- phoneme alignment --
id: from to n_score unit
----------------------------------------
[ 0 68] -19.388691 silB
[ 69 73] -27.776392 k+o
[ 74 81] -29.069199 k-o+r
[ 82 87] -28.384277 o-r+e
[ 88 92] -30.289234 r-e+w
[ 93 100] -28.671051 e-w+a
[ 101 107] -28.113108 w-a+m
[ 108 116] -27.141819 a-m+a
[ 117 125] -29.584555 m-a+i
[ 126 137] -30.973288 a-i+k
[ 138 141] -27.644470 i-k+u
[ 142 150] -30.328804 k-u+n
[ 151 155] -28.697704 u-n+o
[ 156 162] -27.159250 n-o+t
[ 163 170] -26.660339 o-t+e:[o-t+e]
[ 171 180] -25.826904 t-e:+s
[ 181 185] -24.926464 e:-s+u[e-s+u]
[ 186 188] -28.368002 s-u+t
[ 189 195] -27.901438 u-t+o
[ 196 202] -28.325823 t-o+d
[ 203 207] -26.419239 o-d+e
[ 208 217] -23.810205 d-e+s
[ 218 227] -23.197510 e-s+u
[ 228 232] -28.345703 s-u
[ 233 327] -20.266222 silE
re-computed AM score: -7800.596191
=== end forced alignment ===
The first column is in the form [$(begin frame id) $(end frame id)]
. You can calculate timing from frame id.
phonemes begin time in milliseconds = $(begin frame id) * 10 + (12.5 if ($begin frame ID) != 0 else 0);
phonemes end time in milliseconds = ($(end frame id) + 1) * 10 + 12.5;
Edit: fix offset 25ms->12.5ms
Excellent, thanks! @nagadomi solved the puzzle! Hopefully I will be able to follow your instructions and get the same result.
Maybe you can also use pocketsphinx instead of julius. https://cmusphinx.github.io/wiki/phonemerecognition/ https://stackoverflow.com/a/30715295
C:\dictation-kit>nagadomi.bat
C:\dictation-kit>echo audio_mono_16000.wav | julius -C "C:/dictation-kit/am-gmm.jconf" -C "C:/dictation-kit/main.jconf" -palign -1pass -input rawfile STAT: include config: C:/dictation-kit/am-gmm.jconf STAT: include config: C:/dictation-kit/main.jconf STAT: jconf successfully finalized STAT: *** loading AM00 _default Stat: init_phmm: Reading in HMM definition Error: init_phmm: failed to read C:/dictation-kit/model/phone_m/jnas-tri-3k16-gid.binhmm ERROR: m_fusion: failed to initialize AM ERROR: Error in loading model
C:\dictation-kit>
I did this: Saved the command line code (slightly edited) to a Windows .bat file called nagadomi.bat Renamed dictation-kit-master to dictation-kit and copied it to my C:\ directory Copied nagadomi.bat and my correctly resampled audio_mono_16000.wav file into C:\dictation-kit Copied all of the files from C:\dictation-kit\bin\windows (including julius.exe) into C:\dictation-kit Ran nagadomi.bat from the Windows command line with administrator privileges The above error message resulted
Now i have tested on windows. The issue is, model files are not correctly included in dictation-kit-master.zip because dictation-kit's repo uses Git LFS(Git addon for large files).
How to download dictation-kit:
git clone https://github.com/julius-speech/dictation-kit.git
and check the file size of model/phone_m/jnas-tri-3k16-gid.binhmm
. It is 11MB if it is correct.
Edit: test file I used: https://raw.githubusercontent.com/julius-speech/segmentation-kit/master/wav/sample.wav reuslt on windows.
> echo sample.wav | .\julius.exe -C .\dictation-kit\am-gmm.jconf -C .\di
ctation-kit\main.jconf -palign -charconv utf-8 oem -fallback1pass -input rawfile
....
ALIGN: === phoneme alignment begin ===
sentence1: 今日 は いい 天気 だ 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 天気+名詞 だ+助動詞 </s>
phseq1: silB | ky o: | w a | i: | t e N k i | d a | silE
cmscore1: 0.579 0.384 0.458 0.110 0.348 0.497 1.000
score1: -4648.166016
=== begin forced alignment ===
-- phoneme alignment --
id: from to n_score unit
----------------------------------------
[ 0 20] -19.319162 silB
[ 21 30] -25.296345 ky+o:[ky+o]
[ 31 54] -20.190138 ky-o:+w[y-o:+w]
[ 55 67] -23.106276 o:-w+a[o-w+a]
[ 68 78] -27.229338 w-a+i:[w-a+i]
[ 79 97] -24.443262 a-i:+t
[ 98 106] -23.263319 i:-t+e[i-t+e]
[ 107 114] -23.706543 t-e+N
[ 115 125] -24.538818 e-N+k
[ 126 136] -23.495716 N-k+i
[ 137 141] -24.824121 k-i+d
[ 142 151] -24.408911 i-d+a
[ 152 161] -24.646778 d-a
[ 162 203] -18.818771 silE
re-computed AM score: -4540.023438
=== end forced alignment ===
It seems that -fallback1pass
is bettern then -1pass
. With some wav files, 2pass will fail, but if you specify -fallback1pass
, it seems that only 1pass will be executed if 2pass fails.
So why does this large files issue happen only on Windows? Why does not this large files issue also happen on Linux?
No, I have cloned git repo on Linux with Git-LFS, so I was not aware of that issue. Downloading from https://osdn.net/projects/julius/releases/66544 seems to contain all files.
OSDN download of Julius dictation kit (407 MB) https://osdn.net/projects/julius/downloads/66544/dictation-kit-v4.4.zip/
I don't understand what the results mean or how to make phonemes animation from them: t-e+N a-i:+t ky-o:+w[y-o:+w] We need timing information of each phoneme. The results are 3 phonemes with + - : [ ] .
First, I am not familiar with speech recognition field and julius, I am investigating about that now.
unit
is Triphone HMM format that is a representation of the continuity of the phonemes.
Simply, we can ignore other than the first phonemes.
So maybe we can change the command parameter of julius..
# use monophone model
echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile
and result is here.
=== begin forced alignment ===
-- phoneme alignment --
id: from to n_score unit
----------------------------------------
[ 0 21] -19.382792 silB
[ 22 30] -25.662249 ky
[ 31 53] -20.818394 o:
[ 54 67] -23.519619 w
[ 68 74] -27.360666 a
[ 75 96] -25.364830 i:
[ 97 106] -23.746216 t
[ 107 116] -24.600780 e
[ 117 125] -24.467638 N
[ 126 136] -24.526390 k
[ 137 142] -25.521524 i
[ 143 149] -24.958252 d
[ 150 159] -24.305761 a
[ 160 203] -19.381775 silE
re-computed AM score: -4612.191895
=== end forced alignment ===
extracting phonemes without silB,siliE and sp ..
kyowaitenkida
clean and segmentation with romaji.
kyo wa i te n ki da
simplify to vowel, it will be MMD morph.
o a i e n i a
Note that just ignore n(ん)
because there is no mouth movement.
Some models do not have e(え), but I think it can be replaced by a(あ) + i(い) compound.
"phonemes begin time in milliseconds = $(begin frame id) 10 + (25 if ($begin frame ID) != 0 else 0); phonemes end time in milliseconds = ($(end frame id) + 1) 10 + 25;"
So the duration of each frame is 1/100 second?
We need something which is user-friendly for the average user of MMD. So if anyone has any ideas about how to make this more user-friendly, please share these ideas.
echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile -outfile
Adding the switch -outfile and sample.wav phonemes are printed to a text file called sample.out
So the duration of each frame is 1/100 second?
Yes, the duration of each frame is 0.01s, but frames other than 0 has 0.0125s offset. (In the above code, I described it is 25ms, but 12.5ms is correct :disappointed: )
This code will be helpful. https://github.com/julius-speech/segmentation-kit/blob/78ece00b3c0e52f9281667a0d69e91558404816c/segment_julius.pl#L151-L174
When you say that there is an offset of 1/80 second, I think this means that the first 1/80 seconds of the .wav audio is ignored and omitted from the analysis. Right?
And therefore the duration of the resulting lip animation will be 1/80 seconds shorter than the duration of the .wav audio recording. Right?
But the resulting lip animation and the .wav audio recording should both end at exactly the same time. Right? The last keyframe of the lip animation should happen at the exact time as the end of the .wav audio recording. Right?
Okay, I am already programming a new "Import Julius Data" add-on. I need to have a complete list of all phonemes which may appear in the .out file.
silB = beginning silence silE = end silence I don't know if these have any relationship to this annoying offset (1/80 second).
I tried writing code to convert the julius output to vowels only sentence, and frame id to time. I have not confirmed the actual motion yet, so I’m not sure this code is correct.
code: https://gist.github.com/nagadomi/2b8131ed5f50e375f306b146f8840d11 test file tts.wav: tts.wav.zip
I have been disconnected from the internet for the past 3 days because of a Windows malfunction. I haven't finished the Julius lip sync add-on that I was working on. Do want to make this Julius lip sync add-on @nagadomi ? This is completely okay with me. I am trying to do too many things at the same time, but I cannot do everything at the same time.
lip sync Julius to Blender MMD add-on https://sta.sh/022qfe96dpx2
Blender has an issue with frames per second. It looks to me that audio files are imported into the Blender video editor with a frame rate of 30 fps. And the maximum preview playback fps of a morphs animation (in Blender 2.79b) seems to be 30 fps. So I cannot test this add-on with any FPS other than 30 FPS.
That is my contribution to this effort. I have no plan to spend any additional time developing this add-on. If @nagadomi wants to be the developer of this add-on, that is completely okay with me. lip sync Julius to Blender MMD add-on is an importer of Julius .out text files. The mesh object of an MMD character must be the active object when importing the .out file. A menu item for this add-on appears in Blender's File, Import menu.
OK, I may develop it when I have spare time, but I can't promise it.
And the maximum preview playback fps of a morphs animation (in Blender 2.79b) seems to be 30 fps.
I often use 60 fps. You can change fps with Render tab/Dimensions/Frame Rate
, and sync-mode
on timeline windows to set AV-sync
.
btw, I noticed that Julius strips complete silence frame(zero value). It causes a issue that sound sources and motion can not sync. A quick and dirty solution is to add noise to the sound source before processing.
sox tts.wav -p synth whitenoise vol 0.02 | sox -m tts.wav - tts_addednoise.wav
I have crazy idea to use something like https://github.com/TadasBaltrusaitis/OpenFace to make lipsync from video, not audio. It can simplify creating some mimic effects without actual audio.
It's not so crazy idea, libraries that captures face morphs from the video such as Live2D/Facerig is already in use in virtual yutuber field. and, this guy have been developing a toolset to capture full motion from video. https://github.com/miu200521358/3d-pose-baseline-vmd video: http://www.nicovideo.jp/watch/sm33232251
Edit: "read facial expression from photo / movie and generate VMD motion" https://github.com/errno-mmd/readfacevmd
Cool :) It's already been done. I'll look at it later and try to integrate to saba or something, to achieve real-time preview & record from webcam.
lip sync Julius to Blender MMD add-on https://sta.sh/07g7actoiqa
Instructions:
Julius dictation kit osdn.net project releases page: https://osdn.net/projects/julius/releases/66544
Julius dictation kit osdn.net download link: https://osdn.net/projects/julius/downloads/66544/dictation-kit-v4.4.zip/
After unzipping the downloaded file (and possibly copying it to the topmost directory of your hard drive), rename the folder dictation-kit-master to dictation-kit .
Windows (and Mac OSX and Linux) instructions:
Copy all of the files from dictation-kit\bin\windows (including julius.exe) into the dictation-kit folder
(I guess that you should copy the files from dictation-kit\bin\osx if you are using Macintosh OSX, and that you should copy the files from dictation-kit\bin\linux if you are using linux.)
Convert a speech audio file to an uncompressed mono, 16000 samples-per-second, .wav audio file which is the input format required by Julius. Here is an example audio file that you can use: https://raw.githubusercontent.com/julius-speech/segmentation-kit/master/wav/sample.wav
Copy the following command line to a text file and save it to a file with a .bat extension. (Save it to a .sh file if you are using Mac OSX or Linux.):
echo sample.wav |julius -C ../dictation-kit/main.jconf -h ../dictation-kit/model/phone_m/jnas-mono-16mix-gid.binhmm -palign -fallback1pass -input rawfile -outfile
(The above command line is for an audio file named sample.wav. So, if your audio file is not already named sample.wav, either rename your audio file to sample.wav, or edit the command line to refer to the correct name of your audio file. The name of your audio file must not have any spaces in it.)
Copy this .bat file (or .sh file) and the correctly resampled .wav audio file into the dictation-kit folder.
Run the .bat (or .sh) file from the command line. This should output a file named sample.out (or audiofilename_whatever.out). You can then find this .out file in the dictation-kit folder.
Import this .out file into Blender. Select File, Input, Julius import operator. Leave the import frames per second at 30 FPS. The mesh object of an MMD model must be the active object. You should then see a (not-perfect) lip sync animation on your MMD model when you playback the animation.
So, if we want Julius lip sync animation for Blender and MMD, there are basically 2 sets of unsolved problems:
Needing an algorithm to make smooth transitions from one phoneme to the next phoneme.
We want user friendliness. Someone just needs to navigate to, and select, an audio file and then the lip sync animation is automatically created in Blender. Remote controlling Julius and sending command line arguments to Julius. Compatibility with 3 different operating systems. Possible issues with needing administrator privileges to run Julius. Issues related to audio file does not have the correct format for Julius. Possible issue of user has not correctly copied files from the bin folder.
I have no idea how to program any of this. This is outside my python programming experience.
Download LipSynchloid lip sync plug-in (Windows dll) for MikuMikuMoving: https://bowlroll.net/file/29218
The download password is posted in the topmost user comment on the download page. Someone else posted a comment saying that the password does not work, but it really does work. Javascript must be enabled in your web browser.
LipSynchloid video tutorials: https://www.youtube.com/watch?v=UIWbeCqvj5c
https://sites.google.com/site/khuuyjblend/home/blender/script/lipsync
Here is a lip synchronization tool by a Japanese author which uses Blender, Python, and an open source speech recognition engine called Julius. It may be awhile before I have time to take a look at it, but maybe someone else might like to look at it now, so I am sharing the link now. Any thoughts about how this could be adapted to MMD and mmd_tools? The ideal would be a direct conversion of an audio file to a VMD talking animation, to speed up making a lip sync animation.