Open migueljette opened 7 years ago
Hello! Great question: I'd love to see if alignment accuracy can improve with nnet3. The Gentle codebase currently uses online decoding, which I understand is not supported in nnet3 (?), but may not be hugely relevant. The other barrier is availability of pre-trained models. We are currently using the pre-trained Fisher English model provided by the Kaldi project. I don't have access to the LDC corpus, personally, so would probably switch to librispeech, but that may add another layer of complexity. Would love your thoughts on and help with this!
-Robert
The Gentle codebase currently uses online decoding, which I understand is not supported in nnet3
Online decoding is supported in nnet3, same interface as in nnet2
The other barrier is availability of pre-trained models.
The ASPIRE chain model was recently released at http://kaldi-asr.org/models.html. It is more accurate than fisher and much faster too.
Amazing news! Thanks for sharing @nshmyrev ! I didn't know about the release of the aspire model from Dan.
@strob -- So it seems like all barriers are out. :) If you find the time to do this, it would be very interesting indeed! nnet3 is faster and more accurate!
A more accurate model would for sure help with your alignment. I built my own models with data I have at work here and used your tool and the alignment I got was way better (fewer missing words). But I understand that it might not be your priority. :) The nice thing too is that aspire would be good in noisy condition, which would help alignment of "difficult audios".
By the way, do you have a way to "estimate" the timing for the missed words? Would be amazing to have a flag to do that. The simplest form would be to estimate by spreading the timing to the missed words (say you have a bloc of Y seconds with N words... they each get assigned Y/N seconds).
I'll try to find some time to look into nnet3/aspire this week: very exciting developments!
On Mon, Jan 23, 2017, 4:48 PM Mig notifications@github.com wrote:
Amazing news! Thanks for sharing @nshmyrev https://github.com/nshmyrev ! I didn't know about the release of the aspire model from Dan.
@strob https://github.com/strob -- So it seems like all barriers are out. :) If you find the time to do this, it would be very interesting indeed! nnet3 is faster and more accurate!
A more accurate model would for sure help with your alignment. I built my own models with data I have at work here and used your tool and the alignment I got was way better (fewer missing words). But I understand that it might not be your priority. :) The nice thing too is that aspire would be good in noisy condition, which would help alignment of "difficult audios".
By the way, do you have a way to "estimate" the timing for the missed words? Would be amazing to have a flag to do that. The simplest form would be to estimate by spreading the timing to the missed words (say you have a bloc of Y seconds with N words... they each get assigned Y/N seconds).
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/lowerquality/gentle/issues/124#issuecomment-274524231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMup9qQLN0vOvO84q_LZyqqO26jocDiks5rVMuxgaJpZM4LqGIy .
ASPIRE/nnet3 seems to be working!
I've pushed a fairly rough stab at it to master, and it's now running on http://gentle-demo.lowerquality.com
Overall the process was fairly smooth. A few hitches:
The filesize of the models for alignment (ie. without support for full transcription) has increased, zipped, from 57mb to 154mb. This seems primarily due to the growth of tdnn_7b_chain_online/final.mdl
. This is a big jump in size, and will particularly affect the DMG user-experience (I'm hesitant to push out a 200mb DMG download). Any ideas about why this happened and if there are any workarounds would be much appreciated.
There seems to be some instability when trying to decode very short (<1sec) chunks of audio. I've modified multipass alignment to avoid triggering this failure mode, but, again, I'm not quite sure where it's coming from yet.
[oov]
has disappeared from the vocabulary. I've attempted to replace it with <unk>
for alignment and disfluency support, but I've not yet measured the implications. Something to look into.
Very curious for others to test! Something like this should update (though there always seem to be kaldi re-compilation complications & perhaps more make clean
ing will be necessary...):
git pull
git submodule update
sh install_models.sh
sh install_language_model.sh
cd ext
sh install_kaldi.sh
make
Hi @strob !
Nice work!! It works perfectly. I installed the new code and ran four tests. The aspire model is much better, in my tests, than the previous model you had (it misses fewer words and is more robust to noises). So I think it's a win!
The size issue is most likely just due to the fact that the aspire model is a way bigger/better model... might be the cost to have better alignment.
I would think that "
Thank you for this! Very good stuff!
Hi there,
I was wondering if there is a plan to update the code to use nnet3. Or maybe someone else has done this already? Also, I do plan on sharing the steps for producing new models that can be used with the current code, but I think nnet3 would be better than nnet2. Cheers, Miguel