Plan to update to use nnet3?

migueljette commented 7 years ago

Hi there,

I was wondering if there is a plan to update the code to use nnet3. Or maybe someone else has done this already? Also, I do plan on sharing the steps for producing new models that can be used with the current code, but I think nnet3 would be better than nnet2. Cheers, Miguel

strob commented 7 years ago

Hello! Great question: I'd love to see if alignment accuracy can improve with nnet3. The Gentle codebase currently uses online decoding, which I understand is not supported in nnet3 (?), but may not be hugely relevant. The other barrier is availability of pre-trained models. We are currently using the pre-trained Fisher English model provided by the Kaldi project. I don't have access to the LDC corpus, personally, so would probably switch to librispeech, but that may add another layer of complexity. Would love your thoughts on and help with this!

-Robert

nshmyrev commented 7 years ago

The Gentle codebase currently uses online decoding, which I understand is not supported in nnet3

Online decoding is supported in nnet3, same interface as in nnet2

The other barrier is availability of pre-trained models.

The ASPIRE chain model was recently released at http://kaldi-asr.org/models.html. It is more accurate than fisher and much faster too.

migueljette commented 7 years ago

Amazing news! Thanks for sharing @nshmyrev ! I didn't know about the release of the aspire model from Dan.

@strob -- So it seems like all barriers are out. :) If you find the time to do this, it would be very interesting indeed! nnet3 is faster and more accurate!

A more accurate model would for sure help with your alignment. I built my own models with data I have at work here and used your tool and the alignment I got was way better (fewer missing words). But I understand that it might not be your priority. :) The nice thing too is that aspire would be good in noisy condition, which would help alignment of "difficult audios".

By the way, do you have a way to "estimate" the timing for the missed words? Would be amazing to have a flag to do that. The simplest form would be to estimate by spreading the timing to the missed words (say you have a bloc of Y seconds with N words... they each get assigned Y/N seconds).

strob commented 7 years ago

I'll try to find some time to look into nnet3/aspire this week: very exciting developments!

On Mon, Jan 23, 2017, 4:48 PM Mig notifications@github.com wrote:

Amazing news! Thanks for sharing @nshmyrev https://github.com/nshmyrev ! I didn't know about the release of the aspire model from Dan.

@strob https://github.com/strob -- So it seems like all barriers are out. :) If you find the time to do this, it would be very interesting indeed! nnet3 is faster and more accurate!

A more accurate model would for sure help with your alignment. I built my own models with data I have at work here and used your tool and the alignment I got was way better (fewer missing words). But I understand that it might not be your priority. :) The nice thing too is that aspire would be good in noisy condition, which would help alignment of "difficult audios".

By the way, do you have a way to "estimate" the timing for the missed words? Would be amazing to have a flag to do that. The simplest form would be to estimate by spreading the timing to the missed words (say you have a bloc of Y seconds with N words... they each get assigned Y/N seconds).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lowerquality/gentle/issues/124#issuecomment-274524231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMup9qQLN0vOvO84q_LZyqqO26jocDiks5rVMuxgaJpZM4LqGIy .

strob commented 7 years ago

ASPIRE/nnet3 seems to be working!

I've pushed a fairly rough stab at it to master, and it's now running on http://gentle-demo.lowerquality.com

Overall the process was fairly smooth. A few hitches:

The filesize of the models for alignment (ie. without support for full transcription) has increased, zipped, from 57mb to 154mb. This seems primarily due to the growth of tdnn_7b_chain_online/final.mdl. This is a big jump in size, and will particularly affect the DMG user-experience (I'm hesitant to push out a 200mb DMG download). Any ideas about why this happened and if there are any workarounds would be much appreciated.
There seems to be some instability when trying to decode very short (<1sec) chunks of audio. I've modified multipass alignment to avoid triggering this failure mode, but, again, I'm not quite sure where it's coming from yet.
[oov] has disappeared from the vocabulary. I've attempted to replace it with <unk> for alignment and disfluency support, but I've not yet measured the implications. Something to look into.

Very curious for others to test! Something like this should update (though there always seem to be kaldi re-compilation complications & perhaps more make cleaning will be necessary...):

git pull
git submodule update
sh install_models.sh
sh install_language_model.sh
cd ext
sh install_kaldi.sh
make

migueljette commented 7 years ago

Hi @strob ! Nice work!! It works perfectly. I installed the new code and ran four tests. The aspire model is much better, in my tests, than the previous model you had (it misses fewer words and is more robust to noises). So I think it's a win! The size issue is most likely just due to the fact that the aspire model is a way bigger/better model... might be the cost to have better alignment. I would think that "" is fine. It's basically a garbage model, so it probably acts the same way as "[oov]" did. But I have no idea about your second question (small segments). Not sure if it's a model thing or a nnet3 thing... In any case, I will try to plug in my own nnet3 model to see if it's as easy as it was before. :)

Thank you for this! Very good stuff!

lowerquality / gentle

Plan to update to use nnet3? #124