Open iskaj opened 2 years ago
strange, could you post code to reproduce? Thanks!
Hi, I'm having similar issues when trying to construct a swedish LM-model!
I still have the problem which was also mentioned a lot on the huggingface challenge discord earlier, that pyctcdecode doesn't really like putting spaces in the transcription, e.g.:
hetcontenenschip lagaangemeerd indehaze
While it should be:het contenen schip lag angemerd in de haze
The first transcription is without an LM, and the second one is with.
Also, do you mean that the first transcription is with an LM (and lacking spaces), while the second is without an LM (with correct spacing)?
Adding unigrams.txt did help the problem but far from resolved the issues for me. I fine-tuned alpha and beta for OpenSeq2Seq and got very good results (alpha: 1.5, beta 0.7). The problem with OpenSeq2Seq is that we can't restore timestamps for the prediction.
I tried to recreate the result using this library with no success. The transcript is mildly better than greedy decoding but far from OpenSeq2Seq' output (and alpha / beta values are completely different for best output than with OpenSeq2Seq). It's worth mentioning that OpenSeq2Seq operate without unigrams and produce excellent results. Without unigrams, I get far worse output than with greedy (as mentioned before - words are merged together without whitespaces).
The point is - there seems to be some issue. Let me know if I can provide more info to help you find the root cause.
I believe that #5 #25 might be related.
Hey, happy to help look if you can find a way to let me reproduce the issue. Do you have a model and code snippet i can use?
I do have the same problem with Vietnamese w2v2 asr model and a 3-gram LM model. I solved the problem by passing a known word list (unigram) and prune the beam result during the decode process so the result will not include sticky words anymore. Should I make a PR so you guys can take a look ? @gkucsko
sorry, @lopez86 might be able to help. I no longer have maintainer privileges since I have departed Kensho, I apologize
Hi, after some experimentation my colleague discovered that the lack of spaces is significantly alleviated by lowering the parameter token_min_logp
(default is -5. which gives e^(-5)=0.0067) in decode
or decode_beams
call. Perhaps it helps you too @iskaj or @minhnq97. From what I understand, this parameter removes any token that has a log-probability below the threshold (unless it is argmax) from the logit-matrix, so that it (the blank-token in this case) is not even an option for the beam search part. Perhaps @gkucsko can confirm or correct this intuition. Thanks!
Yup that sounds right. The parameter can be used as a trade off between speed and thoroughness of exploring the solution space. The default worked decent on a bunch of examples but very possible that it needs adjusting for your data
So I tried a grid search in my custom dataset with almost every parameter including token_min_logp as @mpierrau and @gkucsko suggested. But in every setting, the result is not better (in terms of WER) than simple greedy decoding. On the other hand, when I use BeamSearchDecoderWithLM with the same n-gram LM, the results are significantly better. Am I missing something?
Based on this #50 turns out that kenlm language models that are produced by nemo scripts are not compatible with pyctcdecode. There reason is that pyctcdecode works with word level language models but nemo works with char or bpe level. So to use pyctcdecode you need to train kenlm models with word level tokenizers.
I do have the same problem with Vietnamese w2v2 asr model and a 3-gram LM model. I solved the problem by passing a known word list (unigram) and prune the beam result during the decode process so the result will not include sticky words anymore. Should I make a PR so you guys can take a look ? @gkucsko
@minhnq97 Could you include some code snippets on how you solved the problem ? I'm facing the same issue with Vietnamese w2v2 when using a LM, especially when the text is a bit long.
I have similar issue that this library does not always use the longer ngram in decoding.
example: Now you feel free to jockdown a quick note, but know that you definitely don't have to
The language model has multiple counts of jot down
:
-0.849238 JOT DOWN -0.3045721
-1.7628679 JOT DOWN </s> 0
-1.8796688 JOT DOWN AND 0
-0.9790521 JOT DOWN THE 0
-1.7186692 JOT DOWN ON 0
-1.1179233 JOT DOWN A -0.18426295
-1.5213368 JOT DOWN WHAT -0.1246664
-1.9819487 JOT DOWN AN 0
-1.820359 JOT DOWN IN 0
-0.45964497 <s> JOT DOWN 0
-0.18280518 AND JOT DOWN -0.2909406
-0.26692805 YOU JOT DOWN 0
-0.39027536 CAN JOT DOWN 0
-0.21837124 TO JOT DOWN -0.2846523
-0.2526344 I JOT DOWN 0
-0.28782836 COULD JOT DOWN 0
-0.27650002 JUST JOT DOWN 0
-0.28586072 WOULD JOT DOWN 0
-0.2257118 OR JOT DOWN 0
-0.105598606 THEN JOT DOWN 0
-0.1537214 QUICKLY JOT DOWN 0
-1.6262659 JOT DOWN ANY 0
-1.935197 JOT DOWN MY 0
-2.0036335 JOT DOWN AS 0
-1.993876 JOT DOWN WHATEVER 0
-1.4112498 JOT DOWN SOME -0.13173665
-1.6525354 JOT DOWN ALL -0.15345488
-1.811944 JOT DOWN THEIR 0
-1.314982 JOT DOWN YOUR -0.034303058
-2.030131 JOT DOWN THINGS 0
-1.6722193 JOT DOWN IDEAS 0
-1.9152358 JOT DOWN EVERYTHING 0
-2.0745046 JOT DOWN HIS 0
-2.1090858 JOT DOWN ANYTHING 0
-1.4251913 JOT DOWN NOTES -0.12113714
-0.869065 JOT DOWN NOTES AND 0
-0.5701695 JOT DOWN WHAT YOU 0
-0.8723011 JOT DOWN SOME OF 0
-0.63709545 AND JOT DOWN THE 0
-0.87691945 TO JOT DOWN THE 0
-0.42619425 JOT DOWN ALL THE 0
-0.9393999 JOT DOWN NOTES ON 0
-0.8654265 AND JOT DOWN A 0
-0.8437338 TO JOT DOWN A -0.13133523
-1.0707138 JOT DOWN NOTES ABOUT 0
-1.4106797 TO JOT DOWN WHAT 0
-1.180675 JOT DOWN A LIST -0.9640336
-0.13003723 YOU CAN JOT DOWN 0
-0.05869481 YOU TO JOT DOWN 0
-0.11509626 WANTED TO JOT DOWN -0.66709435
-0.07675678 TIME TO JOT DOWN 0
-0.10920388 NEED TO JOT DOWN 0
-0.128804 WANT TO JOT DOWN 0
-0.0986982 ABLE TO JOT DOWN 0
-0.071095005 NOTEBOOK TO JOT DOWN 0
-0.025673026 TO QUICKLY JOT DOWN 0
-1.4761158 TO JOT DOWN ANY 0
-1.4763142 TO JOT DOWN MY 0
-1.0507182 AND JOT DOWN SOME 0
-1.1756775 TO JOT DOWN SOME 0
-1.4996119 TO JOT DOWN ALL 0
-1.4513228 TO JOT DOWN THEIR 0
-1.0509561 AND JOT DOWN YOUR 0
-1.2234412 TO JOT DOWN YOUR 0
-0.6993264 JOT DOWN A FEW -0.12375361
-1.30697 TO JOT DOWN IDEAS 0
-1.3485825 JOT DOWN A NOTE 0
and there is no count of jockdown
or jock own
or jock down
.
This is what I have to init decoder:
ctc_decoder = build_ctcdecoder(
labels=['<pad>', '<s>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z'],
kenlm_model_path="/kenlm.bin",
)
So it seems like if the file extension is .bin
. we need to pass in unigrams
, but that actually makes the output skip more spaces.
I think there might be a bug in selecting n-grams where space is not taken into account. For example, if the first word in lm is jot
and the second is down
. we need to lookup p(jot down)
where ` is required. instead of
p(jot)+
p(down)`.
new_beams.append(
Beam(
beam.text,
beam.next_word,
**beam.partial_word + " " + char,**
char,
beam.text_frames,
new_part_frames,
beam.logit_score + p_char,
)
)
line 528, decoder.py, I add space in: beam.partial_word + " " + char, then it works. i dont know is there other problem
I still have the problem which was also mentioned a lot on the huggingface challenge discord earlier, that pyctcdecode doesn't really like putting spaces in the transcription, e.g.:
hetcontenenschip lagaangemeerd indehaze
While it should be:het contenen schip lag angemerd in de haze
The first transcription is without an LM, and the second one is with.