kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.25k stars 5.32k forks source link

Simple instructions for the 99% #4317

Closed WinEunuuchs2Unix closed 3 years ago

WinEunuuchs2Unix commented 3 years ago
WinEunuuchs2Unix commented 3 years ago

I've started reading documentation for an hour and I appreciate the great lengths put into describing the ins and outs of Kaldi. But I wonder if we couldn't have a decaffeinated version for those that aren't rocket scientists or speech recognition PhDs.

Like many people I simply want to take an .m4a, .oga or .wav file on the left hand and lyrics looked up from the internet for a song file on the right hand and merge them together to generate a file of what word appears at what time index.

I'm using Ubuntu 16.04 and will be forced to upgrade to 20.04 next year. I'm just learning Python to develop my own music player and I'm sure lots of other python programmers have done this before me. I don't mind calling bash / shell commands from within python and indeed I do it often already for wmctrl, xclip, xsel, ps aux, xdotool, cp -a, touch and other commands python doesn't do natively. I'm not looking for a python wrapper to c++ and have no qualms about shelling out (subprocess) to a kaldi compiled c++ executable. Nor is compiling a concern.

I think there could be accelerated tutorial for the most common applications and skipping discussion for cluster of Sun Microsystems super computers or nVidia Cuda (even though I have two laptops with nVidia on home LAN).

Please don't interpret this as a complaint. The documentation I've read to date is wonderful. I just wish there was a a fast track for the 99% who simply want their music player to display lyrics synchronized to sound track. That said your programs are exciting and I look forward to learning some of them.

If you already have documentation for the 99% who want synchronized song lyrics please point out the link and forget this comment.

galv commented 3 years ago

I have used kaldi for a long time, and I have been using it lately for something called "forced alignment", which is quite similar to getting the timestamps for each word in song lyrics, so I can comment on this.

If you already have the song lyrics text, you can try following this tutorial: https://www.eleanorchodroff.com/tutorial/kaldi/forced-alignment.html (note that I am not associated with it and cannot vouch for how good it is, but I believe the author was a PhD student who used kaldi extensively for forced alignment). It uses "GMM-HMM" models, which don't require a lot of compute (no need to think about CUDA).

Note that there is other forced alignment software out there, but it can all be rather "academic" (i.e., no easy to use interface).

If you don't have the song lyrics text or that tutorial doesn't work for some reason, I honestly recommend just using an ASR service. They have free tiers. Google's ASR system provides timestamps here for example: https://cloud.google.com/speech-to-text/docs/basics#time-offsets

Good luck, Daniel

On Mon, Nov 2, 2020 at 5:38 PM WinEunuuchs2Unix notifications@github.com wrote:

I've started reading documentation for an hour and I appreciate the great lengths put into describing the ins and outs of Kaldi. But I wonder if we couldn't have a decaffeinated version for those that aren't rocket scientists or speech recognition PhDs.

Like many people I simply want to take an .m4a, .oga or .wav file on the left hand and lyrics looked up from the internet for a song file on the right hand and merge them together to generate a file of what word appears at what time index.

I'm using Ubuntu 16.04 and will be forced to upgrade to 20.04 next year. I'm just learning Python to develop my own music player and I'm sure lots of other python programmers have done this before me. I don't mind calling bash / shell commands from within python and indeed I do it often already for wmctrl, xclip, xsel, ps aux, xdotool, cp -a, touch and other commands python doesn't do natively. I'm not looking for a python wrapper to c++ and have no qualms about shelling out (subprocess) to a kaldi compiled c++ executable. Nor is compiling a concern.

I think there could be accelerated tutorial for the most common applications and skipping discussion for cluster of Sun Microsystems super computers or nVidia Cuda (even though I have two laptops with nVidia on home LAN).

Please don't interpret this as a complaint. The documentation I've read to date is wonderful. I just wish there was a a fast track for the 99% who simply want their music player to display lyrics synchronized to sound track. That said your programs are exciting and I look forward to learning some of them.

If you already have documentation for the 99% who want synchronized song lyrics please point out the link and forget this comment.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4317#issuecomment-720845184, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEL6UDPIWRL3MEFJO3RF6DSN5NIRANCNFSM4TIEVY4Q .

-- Daniel Galvez http://danielgalvez.me https://github.com/galv

ognjentodic commented 3 years ago

You could take a look at https://montreal-forced-aligner.readthedocs.io/en/latest/ -- it uses Kaldi under the hood for the forced alignment task that Daniel mentioned.

I've done quite a bit of music/lyric synchronization loooong time ago; heads up that GMM models might not be very robust to a lot of "background noise" (ahem, anything that's not singing)... I am not sure what models Montreal aligner uses. Another thing you could do is extract only voice/singing from the audio, for the purpose of alignment -- I remember seeing a mention of a paper (perhaps from Facebook, or Spotify can't recall) a few days ago that was doing this.

jtrmal commented 3 years ago

yeah, i've heard good things about the montreal aligner. alignment as such is an easier task than decoding, so the models do not have to be well customized to the channel. y.

On Tue, Nov 3, 2020 at 1:06 AM Ognjen Todic notifications@github.com wrote:

You could take a look at https://montreal-forced-aligner.readthedocs.io/en/latest/ -- it uses Kaldi under the hood for the forced alignment task that Daniel mentioned.

I've done quite a bit of music/lyric synchronization loooong time ago; heads up that GMM models might not be very robust to a lot of "background noise" (ahem, anything that's not singing)... I am not sure what models Montreal aligner uses. Another thing you could do is extract only voice/singing from the audio, for the purpose of alignment -- I remember seeing a mention of a paper (perhaps from Facebook, or Spotify can't recall) a few days ago that was doing this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4317#issuecomment-720923028, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3RC622FSY4SNWTLNLSN6MVVANCNFSM4TIEVY4Q .

kkm000 commented 3 years ago

I'm closing this issue, because it is a question, not a feature proposal or a defect report. We have a forum that is the best place to ask questions, and getting answers and input from other users as well. We are hanging out there all the time, too. Here, in the issue tracker, we are just too few.

Kaldi help forum: https://groups.google.com/forum/#!forum/kaldi-help

Instructions for joining: http://kaldi-asr.org/forums.html

If you think I misunderstood your intention, please reply, and I'll reopen it.

WinEunuuchs2Unix commented 3 years ago

@kkm000 In a way it is a feature request for targeted documentation towards folks who only want to do forced alignment and not full blown speech recognition. That said before you closed this issue I got some great advice :) Thank you @jtrmal and all the others.

On Wed, Nov 11, 2020 at 12:25 AM kkm000 notifications@github.com wrote:

I'm closing this issue, because it is a question, not a feature proposal or a defect report. We have a forum that is the best place to ask questions, and getting answers and input from other users as well. We are hanging out there all the time, too. Here, in the issue tracker, we are just too few.

Kaldi help forum: https://groups.google.com/forum/#!forum/kaldi-help

Instructions for joining: http://kaldi-asr.org/forums.html

If you think I misunderstood your intention, please reply, and I'll reopen it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4317#issuecomment-725255190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICIBH3J6OALN4CZYB7Q5BTSPI35HANCNFSM4TIEVY4Q .

kkm000 commented 3 years ago

Open it?

WinEunuuchs2Unix commented 3 years ago

@kkm000 are you asking to re-open this closed issue or are you asking me to open a new issue in the forums?

If the former I would answer "yes" because other great answers may come. If the latter I'm reluctant to open a new issue somewhere else because great answers have already been posted here and they should be duplicated there to benefit others seeking a streamlined synchronized song lyrics setup.

On Sun, Nov 15, 2020 at 2:50 PM kkm000 notifications@github.com wrote:

Open it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4317#issuecomment-727642791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICIBH5XNI3Q5J5XLJ7AYO3SQBEKJANCNFSM4TIEVY4Q .