kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

Lack of spaces in output #55

Open iskaj opened 2 years ago

iskaj commented 2 years ago

I still have the problem which was also mentioned a lot on the huggingface challenge discord earlier, that pyctcdecode doesn't really like putting spaces in the transcription, e.g.: hetcontenenschip lagaangemeerd indehaze While it should be: het contenen schip lag angemerd in de haze

The first transcription is without an LM, and the second one is with.

gkucsko commented 2 years ago

strange, could you post code to reproduce? Thanks!

mpierrau commented 2 years ago

Hi, I'm having similar issues when trying to construct a swedish LM-model!

mpierrau commented 2 years ago

I still have the problem which was also mentioned a lot on the huggingface challenge discord earlier, that pyctcdecode doesn't really like putting spaces in the transcription, e.g.: hetcontenenschip lagaangemeerd indehaze While it should be: het contenen schip lag angemerd in de haze

The first transcription is without an LM, and the second one is with.

Also, do you mean that the first transcription is with an LM (and lacking spaces), while the second is without an LM (with correct spacing)?

vidklopcic commented 2 years ago

Adding unigrams.txt did help the problem but far from resolved the issues for me. I fine-tuned alpha and beta for OpenSeq2Seq and got very good results (alpha: 1.5, beta 0.7). The problem with OpenSeq2Seq is that we can't restore timestamps for the prediction.

I tried to recreate the result using this library with no success. The transcript is mildly better than greedy decoding but far from OpenSeq2Seq' output (and alpha / beta values are completely different for best output than with OpenSeq2Seq). It's worth mentioning that OpenSeq2Seq operate without unigrams and produce excellent results. Without unigrams, I get far worse output than with greedy (as mentioned before - words are merged together without whitespaces).

The point is - there seems to be some issue. Let me know if I can provide more info to help you find the root cause.

I believe that #5 #25 might be related.

gkucsko commented 2 years ago

Hey, happy to help look if you can find a way to let me reproduce the issue. Do you have a model and code snippet i can use?

minhnq97 commented 2 years ago

I do have the same problem with Vietnamese w2v2 asr model and a 3-gram LM model. I solved the problem by passing a known  word list (unigram) and prune the beam result during the decode process so the result will not include sticky words anymore. Should I make a PR so you guys can take a look ? @gkucsko

gkucsko commented 2 years ago

sorry, @lopez86 might be able to help. I no longer have maintainer privileges since I have departed Kensho, I apologize

mpierrau commented 2 years ago

Hi, after some experimentation my colleague discovered that the lack of spaces is significantly alleviated by lowering the parameter token_min_logp (default is -5. which gives e^(-5)=0.0067) in decode or decode_beams call. Perhaps it helps you too @iskaj or @minhnq97. From what I understand, this parameter removes any token that has a log-probability below the threshold (unless it is argmax) from the logit-matrix, so that it (the blank-token in this case) is not even an option for the beam search part. Perhaps @gkucsko can confirm or correct this intuition. Thanks!

gkucsko commented 2 years ago

Yup that sounds right. The parameter can be used as a trade off between speed and thoroughness of exploring the solution space. The default worked decent on a bunch of examples but very possible that it needs adjusting for your data

xro7 commented 2 years ago

So I tried a grid search in my custom dataset with almost every parameter including token_min_logp as @mpierrau and @gkucsko suggested. But in every setting, the result is not better (in terms of WER) than simple greedy decoding. On the other hand, when I use BeamSearchDecoderWithLM with the same n-gram LM, the results are significantly better. Am I missing something?

xro7 commented 2 years ago

Based on this #50 turns out that kenlm language models that are produced by nemo scripts are not compatible with pyctcdecode. There reason is that pyctcdecode works with word level language models but nemo works with char or bpe level. So to use pyctcdecode you need to train kenlm models with word level tokenizers.

thangld201 commented 2 years ago

I do have the same problem with Vietnamese w2v2 asr model and a 3-gram LM model. I solved the problem by passing a known  word list (unigram) and prune the beam result during the decode process so the result will not include sticky words anymore. Should I make a PR so you guys can take a look ? @gkucsko

@minhnq97 Could you include some code snippets on how you solved the problem ? I'm facing the same issue with Vietnamese w2v2 when using a LM, especially when the text is a bit long.

taylorchu commented 2 years ago

I have similar issue that this library does not always use the longer ngram in decoding.

example: Now you feel free to jockdown a quick note, but know that you definitely don't have to

The language model has multiple counts of jot down:

-0.849238   JOT DOWN    -0.3045721

-1.7628679  JOT DOWN </s>   0
-1.8796688  JOT DOWN AND    0
-0.9790521  JOT DOWN THE    0
-1.7186692  JOT DOWN ON 0
-1.1179233  JOT DOWN A  -0.18426295
-1.5213368  JOT DOWN WHAT   -0.1246664
-1.9819487  JOT DOWN AN 0
-1.820359   JOT DOWN IN 0
-0.45964497 <s> JOT DOWN    0
-0.18280518 AND JOT DOWN    -0.2909406
-0.26692805 YOU JOT DOWN    0
-0.39027536 CAN JOT DOWN    0
-0.21837124 TO JOT DOWN -0.2846523
-0.2526344  I JOT DOWN  0
-0.28782836 COULD JOT DOWN  0
-0.27650002 JUST JOT DOWN   0
-0.28586072 WOULD JOT DOWN  0
-0.2257118  OR JOT DOWN 0
-0.105598606    THEN JOT DOWN   0
-0.1537214  QUICKLY JOT DOWN    0
-1.6262659  JOT DOWN ANY    0
-1.935197   JOT DOWN MY 0
-2.0036335  JOT DOWN AS 0
-1.993876   JOT DOWN WHATEVER   0
-1.4112498  JOT DOWN SOME   -0.13173665
-1.6525354  JOT DOWN ALL    -0.15345488
-1.811944   JOT DOWN THEIR  0
-1.314982   JOT DOWN YOUR   -0.034303058
-2.030131   JOT DOWN THINGS 0
-1.6722193  JOT DOWN IDEAS  0
-1.9152358  JOT DOWN EVERYTHING 0
-2.0745046  JOT DOWN HIS    0
-2.1090858  JOT DOWN ANYTHING   0
-1.4251913  JOT DOWN NOTES  -0.12113714
-0.869065   JOT DOWN NOTES AND  0
-0.5701695  JOT DOWN WHAT YOU   0
-0.8723011  JOT DOWN SOME OF    0
-0.63709545 AND JOT DOWN THE    0
-0.87691945 TO JOT DOWN THE 0
-0.42619425 JOT DOWN ALL THE    0
-0.9393999  JOT DOWN NOTES ON   0
-0.8654265  AND JOT DOWN A  0
-0.8437338  TO JOT DOWN A   -0.13133523
-1.0707138  JOT DOWN NOTES ABOUT    0
-1.4106797  TO JOT DOWN WHAT    0
-1.180675   JOT DOWN A LIST -0.9640336
-0.13003723 YOU CAN JOT DOWN    0
-0.05869481 YOU TO JOT DOWN 0
-0.11509626 WANTED TO JOT DOWN  -0.66709435
-0.07675678 TIME TO JOT DOWN    0
-0.10920388 NEED TO JOT DOWN    0
-0.128804   WANT TO JOT DOWN    0
-0.0986982  ABLE TO JOT DOWN    0
-0.071095005    NOTEBOOK TO JOT DOWN    0
-0.025673026    TO QUICKLY JOT DOWN 0
-1.4761158  TO JOT DOWN ANY 0
-1.4763142  TO JOT DOWN MY  0
-1.0507182  AND JOT DOWN SOME   0
-1.1756775  TO JOT DOWN SOME    0
-1.4996119  TO JOT DOWN ALL 0
-1.4513228  TO JOT DOWN THEIR   0
-1.0509561  AND JOT DOWN YOUR   0
-1.2234412  TO JOT DOWN YOUR    0
-0.6993264  JOT DOWN A FEW  -0.12375361
-1.30697    TO JOT DOWN IDEAS   0
-1.3485825  JOT DOWN A NOTE 0

and there is no count of jockdown or jock own or jock down.

This is what I have to init decoder:

ctc_decoder = build_ctcdecoder(
    labels=['<pad>', '<s>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z'],
    kenlm_model_path="/kenlm.bin",
)
taylorchu commented 2 years ago

So it seems like if the file extension is .bin. we need to pass in unigrams, but that actually makes the output skip more spaces.

I think there might be a bug in selecting n-grams where space is not taken into account. For example, if the first word in lm is jot and the second is down. we need to lookup p(jot down) where ` is required. instead ofp(jot)+p(down)`.

changsha2999 commented 8 months ago
                    new_beams.append(
                        Beam(
                            beam.text,
                            beam.next_word,
                            **beam.partial_word + " " + char,**
                            char,
                            beam.text_frames,
                            new_part_frames,
                            beam.logit_score + p_char,
                        )
                    )

line 528, decoder.py, I add space in: beam.partial_word + " " + char, then it works. i dont know is there other problem