transcript to phone sequence

jinserk commented 6 years ago

Hi, I am wondering how I can convert the target transcript texts to its corresponding phone sequence for the word having multiple lexicon definitions in order to train the AM. In my understanding, CTC cannot handle such multiple definitions, but we definitely need to do it, e.g. for the word "read" of its present and past tense, respectively. Of course we can manually check the all target phone sequences of utterances, but it's too time-consuming job. According to checking quickly, it seems that EESEN uses just the last definition of such multiple ones. Is this correct? If so, is there another good alternative strategy I can use to get some correct context independent phone sequence? Thank you!

jinserk commented 6 years ago

According to the following code in utils/prep_ctc_trans.py:

    # read the lexicon into a dictionary data structure
    fread = open(dict_file,'r')
    dict = {}
    for line in fread.readlines():
        line = line.replace('\n','')
        splits = line.split(' ')  # assume there are no multiple spaces
        word = splits[0]
        letters = ''
        for n in range(1, len(splits)):
            letters += splits[n] + ' '
        dict[word] = letters.strip()
    fread.close()

dict[word] will be overwritten if the word is the same, so only the last lexicon remains for the transcript-to-phone conversion. Is this reasonable?

riebling commented 6 years ago

Possibly not, because what if the dictionary lists most frequent first? If the compromise is to choose only one possible pronunciation, it should be the most common one! So I guess it depends on the design of the dictionary

jinserk commented 6 years ago

Thank you for replying, @riebling. Then do we have to assume that the dictionary is well designed as the order of frequency in the CI phone based CTC training? For example, the tedlium recipe in EESEN uses the cantab-TEDLIUM dictionary. And the dictionary has a lot of duplicated definitions in itself. Is it okay if I assume that this dictionary has the multiple definitions as sorted with the order of frequency?

fmetze commented 6 years ago

Yes, right now, we are only using one pronunciation, whichever one happens to be last (I believe).

This is clearly not the best thing we can do, but dictionary learning has never led to consistent improvements. Yes, for a few words, having the correct pronunciations in the dictionary helps, but CTC (with LSTMS in any case?) works surprisingly well with English character based dictionaries - which are extremely noisy.

If you want, it should be possible to “align” multiple transcriptions to the training data, and see which pronunciation variant is preferred for a given utterance. With the current Eesen code, this is not very elegant, but it should be possible. If it leads to improvements, one could think about implementing a more elegant scheme that aligns lattices rather than phone strings?

On Nov 28, 2017, at 2:33 PM, riebling notifications@github.com wrote:

Possibly not, because what if the dictionary lists most frequent first? If the compromise is to choose only one possible pronunciation, it should be the most common one! So I guess it depends on the design of the dictionary

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/157#issuecomment-347637901, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8RzcMjko4XX1rj6X7bJFM2mrpisqks5s7F_3gaJpZM4Qsppi.

fmetze commented 6 years ago

No, I don’t think so.

On Nov 28, 2017, at 5:28 PM, Jinserk Baik notifications@github.com wrote:

Thank you for replying, @riebling https://github.com/riebling. Then do we have to assume that the dictionary is well designed as the order of frequency in CI phone based CTC training?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/157#issuecomment-347685418, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8dn1iH4aehYqganXEpXaj2JRjvIQks5s7IkdgaJpZM4Qsppi.

jinserk commented 6 years ago

Thank you @fmetze! I just wonder that this is the constraint of LSTM-CTC model, or there exist any better solution but not using it. The "English character based dictionaries" in your comment means the char-based AM? Char-based AM would be the better choice than the phone-based AM?

fmetze commented 6 years ago

yeah, what i am saying is that BLSTM CTC models work remarkably well for a dictionary that consists simply of the English characters. It is generally not better than a phone based AM, but works remarkably well - and is much simpler to handle.

On Nov 28, 2017, at 9:11 PM, Jinserk Baik notifications@github.com wrote:

Thank you @fmetze https://github.com/fmetze! I just wonder that this is the constraint of LSTM-CTC model, or there exist any better solution but not using it. The "English character based dictionaries" in your comment means the char-based AM? Char-based AM would be the better choice than the phone-based AM?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/157#issuecomment-347729685, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8Xq-OybBu2LZ8ZP7H9FgJxNgFnXsks5s7L1egaJpZM4Qsppi.

jinserk commented 6 years ago

I agree that the char-based AM is much simpler for the training, but it requires re-training when we want to add new words to be supported. On the other hand, phone-based AM is relatively free from the issue, since we can handle the additional words with G2P, dictionary and graphs. Of course this is a little bit far from the end-to-end concept, however, the re-training is not so simple in the real ASR application. Could you advise about this point for the choice of AM?

fmetze commented 6 years ago

Not sure I understand. The whole point of a character based network is that if you want to add a new word, you know what the representation will be. No need to retrain the network, and no need to derive a lexicon entry. Or are you thinking about a word-based model?

On Nov 29, 2017, at 1:30 AM, Jinserk Baik notifications@github.com wrote:

I agree that the char-based AM is much simpler for the training, but it requires re-training when we want to add new words to be supported. On the other hand, phone-based AM is relatively free from the issue, since we can handle the additional words with G2P, dictionary and graphs. Of course this is a little bit far from the end-to-end concept, however, the re-training is not so simple in the real ASR application. Could you advise about this point for the choice of AM?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/157#issuecomment-347766085, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8UnbntfwQDfsjrSfJ-q8BBnoDrniks5s7PnpgaJpZM4Qsppi.

jinserk commented 6 years ago

Ah, I just thought that I have to retrain the AM if the dictionary changes. So you mean that the char-based AM has some sort of intrinsic phone-to-grapheme converting capability, so it can figure out any new words if the words are not included in the training lexicon? If so, I misunderstood! Sorry for making a confusion!

fmetze commented 6 years ago

yes, this is exactly what is happening. the character AM is learning “typical" G2P rules.

On Nov 29, 2017, at 12:13 PM, Jinserk Baik notifications@github.com wrote:

Ah, I just thought that I have to retrain the AM if the dictionary changes. So you mean that the char-based AM has some kind of intrinsic phone-to-grapheme converting capability, so it can figure out any new words if the words are included in lexicon? If so, I misunderstood! Sorry for making a confusion!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/157#issuecomment-347930165, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8eyi0eGk5mdejGbWWLdXjn7mbkQ7ks5s7ZDIgaJpZM4Qsppi.

srvk / eesen

transcript to phone sequence #157