Bpe vocabulary alternative format

loquela-dev commented 2 years ago

Hi, First of all thanks for this great work, I did not expect python to be this fast for such tasks. I am trying to use the decoder with logits of BPE vocabulary, But my BPE notation is different than yours. Example: I am hap py to be he re the impletation you provide seem to handle notations where with leading space subwords, mine seem to be the inverse, it adds traling space subwords. I tried modifying the code to make it work, but i had no success so far. Is this something you could consider adding as a feature ? if not can you please help me make it work ( which parts of the code should be modified to make this possible, mainly the decoder.py file) Thanks in advance for your help.

poneill commented 2 years ago

Hi, thanks for this. Just to make sure I understand you, if you preprocessed your BPE alphabet to agree with our notation, would that get you to a working solution? It might help to see a MWE, along with a reference to where your byte pairs came from. We'd like to support as many BPE conventions as possible, but there are a lot of them.

loquela-dev commented 2 years ago

Thanks for your answer. Indeed changing the output tokens of my model would solve it, but it will require me to retrain the system, that have very good results with this notation on ( thousands of hours of data) and I won't have access to a gpu machine for a while :/ I used the BPE of the hugging face tokenizer library. it's the same notation as this article: https://leimao.github.io/blog/Byte-Pair-Encoding/ "[“the”, “high”, “est”, “moun”, “tain”]", the suffix is added to the end of the words preceding a space, i used '' here to follow your notation, my BPE uses "<\w> as a suffix. I have incuded my vocabuary file, which is composed of French tokens. train-fr-bpe-1000-vocab.txt

poneill commented 2 years ago

Sorry, yes, I see now. Naturally, you wouldn't want to retrain the acoustic model. And it's not, in general, possible to convert the BPE alphabet from huggingface-style into the format pyctcdecode expects without making statistical guesses as to whether a given HF-style token should be construed as word-initial or not. And it's not totally clear that you'd get good performance on logit matrices produced by such a model, even if you did make reasonable guesses about translating the stop tokens into start tokens. (In your vocab, ~1% of tokens differ only in whether they're word-final or not, so you'd have at least that much irreducible error.)

So this would need to be addressed inside the decoder. https://github.com/kensho-technologies/pyctcdecode/blob/50fd623fdcfd526778a95dfd3ebf5d6603504457/pyctcdecode/decoder.py#L365 is probably a reasonable place to start.

loquela-dev commented 2 years ago

Thanks for your help, I have already tried but no success so far. I always have spaces where i shouldn't in the final results, or i lack spaces or I should have them. Is there a way to simply add a space after each token when it has the a BPE character ? any ideas are welcome.

poneill commented 2 years ago

would it be possible to share a sample logit matrix for the sake of experimentation?

gkucsko commented 2 years ago

hmm, yeah we didn't really plan for that, however most of the functionality should actually already be handled via force_next_break. A logit matrix would probably help yeah. Also, how do you instantiate the alphabet? .

devSPEECH commented 2 years ago

Hi, Sorry I have been locked out of my old account temporarily. I have attached the output logit matrix of an english model that follows the same convention for the ease of testing, the other model was in french. The test audio is from librispeech. the alphabet used is included in the attachements as well. the correct output is : "before he had time to answer a much encumbered vera burst into the room with the question i say can i leave these here these were a small black pig and a lusty specimen of black red gamecock" Greedy decoding by the model in question gives the following output: "before he had time to answer a much encumbered verera burst into the room with the question i say can i give these here these were a small black pig and a lusty specimen of black red game cock"

As for the alphabet, I replace the "<\w> suffix by the notation the decoder uses: "▁" and replace the symbol with "▁⁇▁". An empty label is added for the blank symbol at the end ["!"], the index of the blank symbol is 1000 in this setting. audio_logits_alphabet.zip Thanks a lot for your help

gkucsko commented 2 years ago

Hi, did you make any progress in adjusting the code? I haven't had time to look yet, sorry. I think the modifications would not be hard, however it's probably too specific for your application to included it in the general code base. If other people have similar needs though, i'd be happy looking into finding a more proper solution.

devSPEECH commented 2 years ago

i changed my accoustic model to use your bpe format, it works well now, I did not have any luck changing the code to handle the other format.

gkucsko commented 2 years ago

great to hear!

kensho-technologies / pyctcdecode

Bpe vocabulary alternative format #22