Closed mohamad-hasan-sohan-ajini closed 2 years ago
@hajix as a starting point, can you try re-running decoding with a ctc_decoder
created with log_add=True
? If you're smearing with max
(the default strategy), you'll choose the most likely path, in which case the results you're seeing are correct.
Ii leads to correct answer! Now the decoded string is a
.
[[CTCHypothesis(tokens=tensor([1, 2, 1]), words=[], score=2.214183238519948, timesteps=tensor([0, 2, 3], dtype=torch.int32))]]
Is there any piece of advice about the preds
? They are character probabilities at the moment, tell me if log_prob or raw scores may lead to more stable decoding process.
Thanks @jacobkahn
Glad that worked. In your case, using log_add=True
actually changed the smearing strategy — that is, when merging hypotheses, you logadd the probabilities along their respective paths. The default strategy is with max
smearing, which will take the maximum-probability path. You'll see that based on your predicted distributions, because _
is the most probable token, having that twice is the most probable path, and having the space
should never appear in any predicted output.
Regarding the emissions ("preds
"), they need only be scores; they don't need to be normalized as probabilities even. You can use "log" scores/preds equivalently with decoding, but you may need to change your hyperparameters given that your language model doing scoring may emit unnormalized scores that need to be scaled with hyperparameters differently.
🐛 Describe the bug
I tried to decode my OCR model using
ctc_decoder
fromtorchaudio.models.decoder
. I noticed that the decoded string is something totally irrelevant with the given predictions! As the decoder works fine in the tutorial, the only source that may cause the error, is the custom blank/space character (which have been set something different from default values).Here comes a handy example to show the malfunctionality of the
ctc_decoder
. Assume we have two time steps in which blank and charactera
probabilities are 0.6 and 0.4 respectively. The probability of each path is shown in the following table (blank is shown by_
character):And the above table implies that the most probable label is
(two spaces) as the most probable label. Ignoring the extra spaces at the start and the end of the decoded string (that itself is also a undesirable behavior), the decoded string is something wrong. I'm wondering what cause this error?
a
. But running this snippet show theThanks in advance
Versions
(venv) aj@pc:/media/aj/ssd/compare-CTC-loss-functions$ python /tmp/collect_env.py Collecting environment information... PyTorch version: 1.12.0+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.9.5 (default, Nov 23 2021, 15:27:38) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.6.124 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.47.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-lightning==1.6.5 [pip3] torch==1.12.0+cu116 [pip3] torchaudio==0.12.0+cu116 [pip3] torchmetrics==0.9.2 [pip3] torchvision==0.13.0+cu116 [conda] Could not collect