ericbolo commented 6 years ago

What would be the main steps for building a real-time decoder on top of EESEN?

I read in the EESEN paper that composing the tokens, lexicon and grammar speeds up decoding a great deal, and I'd like to leverage that in a real-time context: capture an audio stream and output the transcripts progressively.

Is that by any chance in the works?

If not, I could try and give it a shot.

Thank you for this great project

fmetze commented 6 years ago

Eric,

thanks for the flowers. The main problem is the use of the bi-directional LSTM as an acoustic model, which in theory requires you to have the while segment available before evaluating the acoustic model, i.e. no run-on or streaming capability like you want.

There are several papers on how to get around this, because all of the really important information for ASR seems to be contained in a 200ms window, so not all the right context is needed. Within our Tensorflow implementation, we want to try a CNN for that, rather than the BLSTM, so we could get around that limitation easily, but we have not gotten round to implement that. The second challenge is of course to implement the code for streaming the audio and actually processing it. There is probably some Kaldi code that we could leverage.

If you're interested, this would be a great addition to Eesen. We're using the fast decoding and small memory footprint for processing large amounts of data, but you're right, it would go well with real-time recognition, too. Want to try?

Let me know what you think!

ericbolo commented 6 years ago

Thank you for your response!

With the current BiLSTM setup, what would be required is then a feature window that slides over the input and includes for each input all the left context and just enough of the right context. Is that correct?

Kaldi does have boilerplates and tools for online decoding: http://kaldi-asr.org/doc/online_decoding.html

In that web page they also mention per-speaker cepstral mean and variance normalization (CMVN), which I see is being used in the example EESEN scripts. What I have in mind is a system that takes any user's voice and outputs a transcript, without necessarily knowing who the speaker is or having access to past utterances. Any challenges to foresee?

ericbolo commented 6 years ago

For online decoding with neural nets, Kaldi recommends constructing an i-vector that summarizes speaker properties, and training the neural net with audio features + i-vector. In the absence of past utterances, the i-vector is built from the audio from time 0 to some time t. So there would be an additional delay from building the i-vector, if I understand correctly.

From http://kaldi-asr.org/doc/online_decoding.html " Our best online-decoding setup, which we recommend should be used, is the neural net based setup. The adaptation philosphy is to give the neural net un-adapted and non-mean-normalized features (MFCCs, in our example recipes), and also to give it an iVector. ... Our idea is that the iVector gives the neural net as much as it needs to know about the speaker properties. This has proved quite useful. The iVector is estimated in a left-to-right way, meaning that at a certain time t, it sees input from time zero to t. It also sees information from previous utterances of the current speaker, if available. "

fmetze commented 6 years ago

Yes, that is by and large correct.

The big challenge is speaker diarization, unless you only have one speaker in your audio channel. Imagine you have two speakers, a loud male and a soft female, in the same channel of the recording that you want to recognize. Even if they don't overlap, you need to separate them somehow, either by estimating and updating the corresponding cepstral means, or the corresponding i-vectors.

I could imagine that a neural network learns the normalization properties from the i-vectors, but I am wondering if that doesn't introduce all other types of failure modes like means not being 0, etc.

In sum, the BLSTM problem is just one problem that you need to solve for on-line recognition. Can you see a solution that would work for you for the above issues, so we can think about the BLSTM issue specifically?

ericbolo commented 6 years ago

Ok, I now have a pretty good understanding of the diarization/speaker normalization issues, none of them insurmountable in my application.

For now, I can focus on the decoding of the BLSTM outputs. You mentioned ASR probably only needs a 200ms window, any papers you could point me to?

fmetze commented 6 years ago

http://www.asru2015.org/Papers/ViewPapers.asp?PaperNum=1103 http://www.asru2015.org/Papers/ViewPapers.asp?PaperNum=1103

On Aug 4, 2017, at 5:37 PM, ericbolo notifications@github.com wrote:

Ok, I now have a pretty good understanding of the diarization/speaker normalization issues, none of them insurmountable in my application.

For now, I can focus on the decoding of the BLSTM outputs. You mentioned ASR probably only needs a 200ms window, any papers you could point me to?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-320280900, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8WwtkAT11TPQIYGznLe62D0uj3esks5sUzq1gaJpZM4OnXsl.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

ericbolo commented 6 years ago

Thank you for the link.

I gave it a quick read, my only worry is that in the paper the data is pre-segmented with GMM/HMM, and the training does not use CTC loss.

Could the 0.5 s spectral window be too small for CTC loss training?

ericbolo commented 6 years ago

Elaborating: from the paper I understand they sub-sample the data to avoid overfitting, so we don't have access to all the outputs of the utterance, possibly hampering CTC loss.

fmetze commented 6 years ago

Right, the paper does not use CTC loss, but I don't think this would matter much, certainly not for the LSTMs, which is where we have the recurrent connections. CTC affects the way the error signal is computed during training, but during inference, the computation is straightfoward and independent of the window. Except of course that we do need a segmentation - which is the other problem that you mention. The exact value of the window, 2*0.2 or 0.5 seconds would have to be determined through experiments, of course.

ericbolo commented 6 years ago

I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/document/7953176/

Thought it might be of interest

fmetze commented 6 years ago

Yes, it is of interest - are you still trying to look into this and maybe implement this or something similar? I was hoping to do something here during the semester,er but it seems we don’t have enough hands as is ...

On Jan 17, 2018, at 8:10 AM, ericbolo notifications@github.com wrote:

I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/document/7953176/ http://ieeexplore.ieee.org/document/7953176/ Thought it might be of interest

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-358299272, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl.

ericbolo commented 6 years ago

I'm still very much interested in online decoding, yes, but unfortunately my hands are full this month and the next.

If anyone is interested to work on this with me starting early April, do let me know !

On 7 February 2018 at 04:48, Florian Metze notifications@github.com wrote:

Yes, it is of interest - are you still trying to look into this and maybe implement this or something similar? I was hoping to do something here during the semester,er but it seems we don’t have enough hands as is ...

On Jan 17, 2018, at 8:10 AM, ericbolo notifications@github.com wrote:

I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/ document/7953176/ http://ieeexplore.ieee.org/document/7953176/ Thought it might be of interest

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/srvk/eesen/issues/141#issuecomment-358299272>, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_ 0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-363648127, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_GeDkfDIdTRJY9gZT2-kkb68Rd71ks5tSRz7gaJpZM4OnXsl .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

fmetze commented 6 years ago

yep, help wanted! probably even in April!

On Feb 7, 2018, at 3:06 AM, ericbolo notifications@github.com wrote:

I'm still very much interested in online decoding, yes, but unfortunately my hands are full this month and the next.

If anyone is interested to work on this with me starting early April, do let me know !

On 7 February 2018 at 04:48, Florian Metze notifications@github.com wrote:

Yes, it is of interest - are you still trying to look into this and maybe implement this or something similar? I was hoping to do something here during the semester,er but it seems we don’t have enough hands as is ...

On Jan 17, 2018, at 8:10 AM, ericbolo notifications@github.com wrote:

I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/ document/7953176/ http://ieeexplore.ieee.org/document/7953176/ Thought it might be of interest

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/srvk/eesen/issues/141#issuecomment-358299272>, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_ 0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-363648127, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_GeDkfDIdTRJY9gZT2-kkb68Rd71ks5tSRz7gaJpZM4OnXsl .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-363688316, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8RoDIc7MNRQYvvs2rN68-g9yPWWbks5tSVmWgaJpZM4OnXsl.

efosler commented 6 years ago

What's the current status on this? I'm starting the (crazy) sabbatical project and it's pretty clear that some sort of online decoding mechanism is going to be necessary. I can probably pitch in to help but I'll be shaking the rust off of my coding skills. @ericbolo, @fmetze any interest in this?

ericbolo commented 6 years ago

Hi Eric,

I'm still interested in online decoding but company priorities have caught up to me and I can't do it single-handedly .

This said, if we can team up and brainstorm beforehand I'd be more than happy to contribute !

On Wed, Jun 20, 2018, 5:32 PM Eric Fosler-Lussier notifications@github.com wrote:

What's the current status on this? I'm starting the (crazy) sabbatical project and it's pretty clear that some sort of online decoding mechanism is going to be necessary. I can probably pitch in to help but I'll be shaking the rust off of my coding skills. @ericbolo https://github.com/ericbolo, @fmetze https://github.com/fmetze any interest in this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-398794270, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_L2hwKpf8-pGDESc9qZqKME3pfaEks5t-msEgaJpZM4OnXsl .

ericbolo commented 6 years ago

From @fmeze 's answer above, implementing online decoding requires (1) audio streaming capability: work required but not exploratory, probably some open source tools available, (2) overhauling the model to be compatible with true real-time processing ( CNN ? Wavenet ?), or keep the current BLSTM architecture and adding a 200ms second window to capture the right-context. Generally, CNN is much faster than LSTM so we should also take that into account.

Does anyone know of a good CNN architecture for acoustic modeling ?

On 20 June 2018 at 18:15, Eric Bolo bolo.eric@gmail.com wrote:

Hi Eric,

I'm still interested in online decoding but company priorities have caught up to me and I can't do it single-handedly .

This said, if we can team up and brainstorm beforehand I'd be more than happy to contribute !

On Wed, Jun 20, 2018, 5:32 PM Eric Fosler-Lussier < notifications@github.com> wrote:

What's the current status on this? I'm starting the (crazy) sabbatical project and it's pretty clear that some sort of online decoding mechanism is going to be necessary. I can probably pitch in to help but I'll be shaking the rust off of my coding skills. @ericbolo https://github.com/ericbolo, @fmetze https://github.com/fmetze any interest in this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-398794270, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_L2hwKpf8-pGDESc9qZqKME3pfaEks5t-msEgaJpZM4OnXsl .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

efosler commented 6 years ago

@fmetze 's comment was that CTC seems to be relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze might have some thoughts on that.

The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo pointed out) would probably not be difficult to implement.

The real question is what branch to target. I think, in talking with @fmetze, that the tf branch is the future for Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop".

ericbolo commented 6 years ago

Regarding the branch, I for one am a lot more comfortable with tf.

And sorry but what do you mean by "python in the loop". And what would be the broad specs of your eventual application ?

On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier notifications@github.com wrote:

@fmetze https://github.com/fmetze 's comment was that CTC seems to be relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze https://github.com/fmetze might have some thoughts on that.

The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo https://github.com/ericbolo pointed out) would probably not be difficult to implement.

The real question is what branch to target. I think, in talking with @fmetze https://github.com/fmetze, that the tf branch is the future for Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399187105, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl .

ramonsanabria commented 6 years ago

Hi,

In the speech course last year some students did a good analysis on CNNs architectures for ASR. I can try to look for that.

I have seen good results using architectures like VGG although this could maybe com with a expenses in the computation side.

One possible interesting solution for using unidirectional LSTM is to train a biLSTM and then perform KL divergence between a unidirectional (non trained model) and a fully trained bidirectional LSTM ( https://arxiv.org/pdf/1711.02212.pdf).

Regarding having python (if this is a concern) there is tf API in C++ that you could maybe use(?).

Thanks!

2018-06-21 14:51 GMT-04:00 ericbolo notifications@github.com:

Regarding the branch, I for one am a lot more comfortable with tf.

And sorry but what do you mean by "python in the loop". And what would be the broad specs of your eventual application ?

On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier < notifications@github.com> wrote:

@fmetze https://github.com/fmetze 's comment was that CTC seems to be relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze https://github.com/fmetze might have some thoughts on that.

The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo https://github.com/ericbolo pointed out) would probably not be difficult to implement.

The real question is what branch to target. I think, in talking with @fmetze https://github.com/fmetze, that the tf branch is the future for Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399187105, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_ EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399206512, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPVFlUv13ECPB5II7fjpEKL-USoNNks5t--sigaJpZM4OnXsl .

ericbolo commented 6 years ago

I find the method which you mention from https://arxiv.org/pdf/1711.02212.pdf appealing. For the benefit of others in this thread I will outline the system:

1/ train a blstm with ctc loss - which Essen already does. Call this model the teacher model.

2/ next train the student model: a backward-only lstm compatible with online decoding, with the loss being the KL divergence between the teacher and student networks' output distributions. In the paper they report that this method reduces the WER by a large margin relative to randomly initialized unidirectional LSTMs

The system still benefits from CTC and we will be able to write the model s code easily

On Jun 21, 2018 10:05 PM, "ramonsanabria" notifications@github.com wrote:

Hi,

In the speech course last year some students did a good analysis on CNNs architectures for ASR. I can try to look for that.

I have seen good results using architectures like VGG although this could maybe com with a expenses in the computation side.

One possible interesting solution for using unidirectional LSTM is to train a biLSTM and then perform KL divergence between a unidirectional (non trained model) and a fully trained bidirectional LSTM ( https://arxiv.org/pdf/1711.02212.pdf).

Regarding having python (if this is a concern) there is tf API in C++ that you could maybe use(?).

Thanks!

2018-06-21 14:51 GMT-04:00 ericbolo notifications@github.com:

Regarding the branch, I for one am a lot more comfortable with tf.

And sorry but what do you mean by "python in the loop". And what would be the broad specs of your eventual application ?

On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier < notifications@github.com> wrote:

@fmetze https://github.com/fmetze 's comment was that CTC seems to be

relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze https://github.com/fmetze might have some

thoughts on that.

The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo https://github.com/ericbolo pointed out)

would

probably not be difficult to implement.

The real question is what branch to target. I think, in talking with @fmetze https://github.com/fmetze, that the tf branch is the future

for

Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399187105, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_ EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl

.

— You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399206512, or mute the thread < https://github.com/notifications/unsubscribe-auth/AMlwPVFlUv13ECPB5II7fjpEKL-USoNNks5t--sigaJpZM4OnXsl

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-399226016, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_FySpF6TeVMJnzCL6C5onL5S8pHhks5t-_yQgaJpZM4OnXsl .

efosler commented 6 years ago

This seems like a reasonable step (student-teacher) and simple to implement.

Re: python in the loop - I'm working on an eesen-in-the-browser project in order to enable some other stuff I want to work on (read: want my students to work on...). It's a bit of a crazy lark - compiling c++ into javascript (asm.js) via emscripten, which python would be problematic for (although @ramonsanabria is right in that I could reimplement in c++). Could be a big fail, but could be interesting if it works (and for which I need the online decoding). I think I'm going to retract what I said about tf, though - in thinking it through, the google folks have already made a javascript version available at js.tensorflow.org which could be used to run the net models probably more efficiently than any attempt I make.

ericbolo commented 6 years ago

As I was looking into the tf_clean branch, these other design questions came to mind:

do we start with the character-based or the acoustic model ? I have experience with the acoustic model only. But using acoustic models excludes RNN-LM, and forces us to use a WFST as the final decoding graph. which motivates my next question:
is it possible to generate lattices and build the final graph fast enough for an online implementation?

ericbolo commented 6 years ago

193 #155 : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.

Do you agree ? @efosler I believe you have started working on that, how may I contribute ?

fmetze commented 6 years ago

@efosler and @ramonsanabria have “full working examples” of WFST decoding for the Tensorflow code base that Ramon created, i think they are checked in, but maybe not in the same branch?

On Jun 28, 2018, at 6:52 AM, ericbolo notifications@github.com wrote:

193 https://github.com/srvk/eesen/issues/193 #155 https://github.com/srvk/eesen/issues/155 : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.

Do you agree ? @efosler https://github.com/efosler I believe you have started working on that, how may I contribute ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/141#issuecomment-400995439, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8d7Zzukquz7HEeHXqF9bTCi02-yNks5uBLV_gaJpZM4OnXsl.

ericbolo commented 5 years ago

update: currently training a forward LSTM with tensorflow, with different loss functions (CTC only, student-teacher loss, etc.).

Looking ahead, I'm studying examples of online decoding. Kaldi has online feature extractors for MFCC and PLP, but not for filterbanks. In the EESEN paper as well as in the tedlium example, filterbanks rather than MFCC are used.

Any reason for choosing fbanks over MFCCs ? Is it simply extraction speed since MFCCs are just filterbanks with postprocessing ? If I train with MFCC features, do you expect I 'll get similar results ?

If using MFCCs is ok, I'll train with that to avoid the work on implementing my own online extractor.

efosler commented 5 years ago

Our experience is that log filterbanks do work somewhat better than MFCCs, but if you're mostly working on pipeline at the moment the hit you'll take on MFCCs will not be large (and it should be easy to sub in online log mel filter banks later).

FWIW, rolling your own log mel filterbank can also be a bit treacherous (although it shouldn't be). We found that using scikit to build features rather than Kaldi was giving us suboptimal performance. We ended up tracing it down to windowing differences, IIRC (we needed to have windowing be a multiple of frame shift for our application, which isn't the default in Kaldi).

ericbolo commented 5 years ago

great, MFCCs it is then (for now)

thank you

srvk / eesen

Real-time decoding #141

193 #155 : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.

193 https://github.com/srvk/eesen/issues/193 #155 https://github.com/srvk/eesen/issues/155 : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.