k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall
Apache License 2.0
143 stars 42 forks source link

Plan for dealing with phonetic (or graphemic) context #120

Open danpovey opened 3 years ago

danpovey commented 3 years ago

I have come up with a plan for how to deal with phonetic/graphemic context.

For now I'll use the label "phonetic", but this is without loss of generality.

There's no reason to be limited by historical constraints etc., so this is a little radical, but also not super hard to implement. It's within-word context that uses the entire within-word phone sequence and a neural net, to decide the symbol sequence to use. [Note: as an extension there is no reason why we can't also add in the embedding of the written form of the word.] The symbols themselves can just be written as the phone's name plus some arbitrary postfix, e.g. "c ae t" -> "c8 ae3 t5" or something like that. (In order to add words to the lexicon, you'd need access to this neural net plus some codebook entries, as I'll describe).

The basic idea is:

  1. Using alignments of a trained system (preferably a relatively simple system like TDNN), obtain some kind of statistics for each phone position of each pronunciation of each word. I'll describe some of the details below, of how we could do this.
  2. Train a not-too-big neural net to predict those statistics given those phone sequences (of all the pronunciations in our lexicon). This can be viewed as a kind of smoothing+backoff step.
  3. Given a "budget" of how many symbols we are allowed to use, do some kind of k-means-like algorithm to allocate context-dependent phones. (In general each phone will have a different number of context-dependent versions).
  4. Use the neural net and the k-means centers to decide the context-dependent symbol sequence for each pronunciation in the lexicon. The output of this stage is just a lexicon.txt (or lexiconp.txt) file, associated with a larger phone-set (phones.txt), and as far as our system is concerned it is still context-independent.

The statistics can probably be phone posterior statistics, i.e. outputs after softmax (we'd have to take log). The "objective function" (maximized in things like k-means) would be the average log-likelihood, i.e. sum_i count_i log(p_i), where p_i = count_i / sum_j count_j. [An alternative would be to take some kind of activations and use sum-squared loss, but I think what I described may be more robust, since it's related to the objective function.]

More detail:

  1. In order to accumulate these statistics, I suggest the following. First, assign a number to each line in lexicon.txt (or lexiconp.txt, but I'll call it lexicon.txt), i.e. to each pronuncation; and this can just be the line number itself. Then assign a number to each phone instance in lexicon.txt, starting from the first line. We'll write out these symbols when we convert lexicon.txt to an FST; the script that creates L.fst can add these symbols. Instead of "src dest ilabel olabel cost", we'll have a format like "src dest label aux_label1 aux_label2... cost". We can just use "aux_labels2", "aux_labels3" and so on for the names of these by default when we read such things.

We can write a program that will do the alignment and accumulate these statistics and write them out.

  1. Training a neural net to predict the statistics: note, the objective function is the one I described above, which is an expected log-likelihood. (we can negate that and call it a loss function).

  2. After running the neural net to get "expected stats" for each position of each word in our lexicon, we run some kind of k-means-type algorithm. Note, it's not exactly k-means as we are measuring this special objective, and also because there are multiple baskets where we have an overall budget.... anyway I can write this, I have done these kinds of things before.

  3. is fairly straightforward... for each word we run the neural net and for each position we test whether its expected stats are more likely given the distributions from each of the k-means centers, and we pick the center that gives the best expected likelihood. The output of this is a lexicon.txt and a phones.txt.

qindazhu commented 3 years ago

I would like to try to do this, though there are many details I still need to figure out for now.

danpovey commented 3 years ago

Great!

On Mon, Mar 8, 2021 at 5:38 PM Haowen Qiu notifications@github.com wrote:

I would like to try to do this, though there are many details I still need to figure out for now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/120#issuecomment-792623221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYQVUHVXSDCW7BEGXLTCSLKBANCNFSM4YZBEL6Q .

danpovey commented 3 years ago

For the algorithm in 3.: A fairly easy way to do the clustering in the presence of a global constraint on num-clusters, would be to initialize each phone with k-means to a number of clusters greater than what we expect to end up with (based on some heuristic, e.g. allocate clusters to phones according to count^{0.33} and then double the number), and then progressively merge those clusters, always taking the one that (globally) gives the least total objective-function change. In terms of data-structures... for each phone we could probably have a matrix of scores "objf-change-from-merging [a,b]", and when we merge a and b we could assign them to a, setting b's row/column to infinity and then recomputing a's row/column. We'd also keep track of (for each phone) the minimum value in the matrix.. possibly doable with a heapq or, for reasonable configurations, we can just do torch.min or something like that (no need to spend too much energy to pre-optimize things). We might have to write our own k-means code because this type of stats is a little different than the normal sumsq-difference thing. The thing we're measuring is always "objf of this cluster after merging" vs. "sum of objf of both clusters, un-merged"... there isn't a concept of distance here as such.

I'm thinking that for this code, most of it will be moderately special purpose python code, e.g. we can assume the lexicon.txt (or lexiconp.txt) format and so on. At this stage the main point is to get it working and not worry excessively about code style and so on.

nshmyrev commented 3 years ago

Hi.

I always favored multiphone clusters (syllables, bpe, etc). Any objection against that so that single phone with context is preferred?

danpovey commented 3 years ago

We can investigate those types of things later, for sure.

On Mon, Mar 8, 2021 at 6:47 PM Nickolay V. Shmyrev notifications@github.com wrote:

Hi.

I always favored multiphone clusters (syllables, bpe, etc). Any objection against that so that single phone with context is preferred?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/120#issuecomment-792666984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5CUM3C5CMVSHPEJTDTCSTKNANCNFSM4YZBEL6Q .

francisr commented 3 years ago

What's the rationale behind the need to explicitly model the context? Can't we just let the neural net learn these things implicitly?

danpovey commented 3 years ago

The neural net doesn't "know" what the context is, there is implicitly an independence assumption between frames. So without having context in the phone set, the neural net is limited in what it can predict.

francisr commented 3 years ago

Oh, I was under the impression that LF-MMI doesn't assume independence between frames.

danpovey commented 3 years ago

In a sense it doesn't, but the number of symbols does limit the amount of information that the acoustic model can transfer to the graph search.

rezame commented 3 years ago

The basic idea is:

1. Using alignments of a trained system (preferably a relatively simple system like TDNN), obtain some kind of statistics for each phone position of each pronunciation of each word.  I'll describe some of the details below, of how we could do this.

2. Train a not-too-big neural net to predict those statistics given those phone sequences (of all the pronunciations in our lexicon).  This can be viewed as a kind of smoothing+backoff step.

3. Given a "budget" of how many symbols we are allowed to use, do some kind of k-means-like algorithm to allocate context-dependent phones.  (In general each phone will have a different number of context-dependent versions).

4. Use the neural net and the k-means centers to decide the context-dependent symbol sequence for each pronunciation in the lexicon.  The output of this stage is just a lexicon.txt (or lexiconp.txt) file, associated with a larger phone-set (phones.txt), and as far as our system is concerned it is still context-independent.

@danpovey I think if we change phones and lexicon base on training data (something like greedy learn lexicon in Kaldi), the generalization of the model decreases, however in training and in-domain test-set (perhaps) the result improves. How can we add new words to the lexicon? How can we produce phones for new words based on new phones and lexicon? I think it'd better we change neutral network or change tree decisions, etc instead of change phones and lexicon base on training data-set. [In some papers describe that grapheme-based model better than phone-based model due to grapheme train more different pronunciations so it robust to different pronunciations e.g. accents] best regards

danpovey commented 3 years ago

The idea was that the neural net allows it to generalize to unseen words. Unfortunately I tried this and didn't get any improvement from it, at least not with the settings I tried.

On Sun, Mar 28, 2021 at 1:39 PM rezame @.***> wrote:

The basic idea is:

  1. Using alignments of a trained system (preferably a relatively simple system like TDNN), obtain some kind of statistics for each phone position of each pronunciation of each word. I'll describe some of the details below, of how we could do this.

  2. Train a not-too-big neural net to predict those statistics given those phone sequences (of all the pronunciations in our lexicon). This can be viewed as a kind of smoothing+backoff step.

  3. Given a "budget" of how many symbols we are allowed to use, do some kind of k-means-like algorithm to allocate context-dependent phones. (In general each phone will have a different number of context-dependent versions).

  4. Use the neural net and the k-means centers to decide the context-dependent symbol sequence for each pronunciation in the lexicon. The output of this stage is just a lexicon.txt (or lexiconp.txt) file, associated with a larger phone-set (phones.txt), and as far as our system is concerned it is still context-independent.

@danpovey https://github.com/danpovey I think if we change phones and lexicon base on training data (something like greedy learn lexicon in Kaldi), the generalization of the model decreases, however in training and in-domain test-set (perhaps) the result improves. How can we add new words to the lexicon? How can we produce phones for new words based on new phones and lexicon? I think it'd better we change neutral network or change tree decisions, etc instead of change phones and lexicon base on training data-set. [In some papers describe that grapheme-based model better than phone-based model due to grapheme train more different pronunciations so it robust to different pronunciations e.g. accents] best regards

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/120#issuecomment-808850929, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZ7WGRVS4YNFFKXA4TTF26J3ANCNFSM4YZBEL6Q .

francisr commented 3 years ago

In a sense it doesn't, but the number of symbols does limit the amount of information that the acoustic model can transfer to the graph search.

Interesting, I hadn't considered that angle.

One thing though is I've seen is that even simple monophone TDNNs can end up learning complex pronunciations: I had an issue in my transcript processing, which didn't remove metadata such as [applause] and [laughter], but converted them to regular words.
The model ended up outputting the words "applause" and "laughter" on audio where people applaud or laugh.

I think in terms of accelerating the decoding, it's definitely interesting to make sure there's more more load in the neural net. Though the recent literature seems to be moving towards end to end approaches such as RNN-T.
What's your opinion on how this fits with k2?

danpovey commented 3 years ago

It's possible to use k-2 with things like RNN-T (or at least a similar flavor) as well, although pure RNN-T wouldn't see much benefit from k2. I'm thinking about using lattices from a 1st-pass decoding and rescoring the paths with models that have recurrence on the input like RNN-T.

On Wed, Apr 21, 2021 at 7:02 PM Rémi Francis @.***> wrote:

In a sense it doesn't, but the number of symbols does limit the amount of information that the acoustic model can transfer to the graph search.

Interesting, I hadn't considered that angle.

One thing though is I've seen is that even simple monophone TDNNs can end up learning complex pronunciations: I had an issue in my transcript processing, which didn't remove metadata such as [applause] and [laughter], but converted them to regular words. The AM ended up outputting the words "applause" and "laughter" on audio where people applaud or laugh.

I think in terms of accelerating the decoding, it's definitely interesting to make sure there's more more load in the neural net. Though the recent literature seems to be moving towards end to end approaches such as RNN-T. What's your opinion on how this fits with k2?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/120#issuecomment-823973851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYDCKA3EASBEX5I6BDTJ2WD3ANCNFSM4YZBEL6Q .