pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.54k stars 654 forks source link

ctc_decoder with custom blank/space token #2558

Closed mohamad-hasan-sohan-ajini closed 2 years ago

mohamad-hasan-sohan-ajini commented 2 years ago

🐛 Describe the bug

I tried to decode my OCR model using ctc_decoder from torchaudio.models.decoder. I noticed that the decoded string is something totally irrelevant with the given predictions! As the decoder works fine in the tutorial, the only source that may cause the error, is the custom blank/space character (which have been set something different from default values).

Here comes a handy example to show the malfunctionality of the ctc_decoder. Assume we have two time steps in which blank and character a probabilities are 0.6 and 0.4 respectively. The probability of each path is shown in the following table (blank is shown by _ character):

Path Prob
__ 0.36
_a 0.24
a_ 0.24
aa 0.16
other paths 0.0

And the above table implies that the most probable label is a. But running this snippet show the (two spaces) as the most probable label. Ignoring the extra spaces at the start and the end of the decoded string (that itself is also a undesirable behavior), the decoded string is something wrong. I'm wondering what cause this error?

import torch
from torchaudio.models.decoder import ctc_decoder

alphabet = ['_', ' ', 'a']
blank_token = alphabet[0]
sil_token = alphabet[1]

ctc_decoder_ = ctc_decoder(
    lexicon=None,
    tokens=alphabet,
    blank_token=blank_token,
    sil_token=sil_token,
)
preds = torch.FloatTensor(
    [
        [0.6, 0, .4],
        [0.6, 0, .4],
    ]
)
preds.unsqueeze_(0)
ctc_decoder_(preds)

Thanks in advance

Versions

(venv) aj@pc:/media/aj/ssd/compare-CTC-loss-functions$ python /tmp/collect_env.py Collecting environment information... PyTorch version: 1.12.0+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.9.5 (default, Nov 23 2021, 15:27:38) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.6.124 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.47.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-lightning==1.6.5 [pip3] torch==1.12.0+cu116 [pip3] torchaudio==0.12.0+cu116 [pip3] torchmetrics==0.9.2 [pip3] torchvision==0.13.0+cu116 [conda] Could not collect

jacobkahn commented 2 years ago

@hajix as a starting point, can you try re-running decoding with a ctc_decoder created with log_add=True? If you're smearing with max (the default strategy), you'll choose the most likely path, in which case the results you're seeing are correct.

mohamad-hasan-sohan-ajini commented 2 years ago

Ii leads to correct answer! Now the decoded string is a.

[[CTCHypothesis(tokens=tensor([1, 2, 1]), words=[], score=2.214183238519948, timesteps=tensor([0, 2, 3], dtype=torch.int32))]]

Is there any piece of advice about the preds? They are character probabilities at the moment, tell me if log_prob or raw scores may lead to more stable decoding process.

Thanks @jacobkahn

jacobkahn commented 2 years ago

Glad that worked. In your case, using log_add=True actually changed the smearing strategy — that is, when merging hypotheses, you logadd the probabilities along their respective paths. The default strategy is with max smearing, which will take the maximum-probability path. You'll see that based on your predicted distributions, because _ is the most probable token, having that twice is the most probable path, and having the space should never appear in any predicted output.

Regarding the emissions ("preds"), they need only be scores; they don't need to be normalized as probabilities even. You can use "log" scores/preds equivalently with decoding, but you may need to change your hyperparameters given that your language model doing scoring may emit unnormalized scores that need to be scaled with hyperparameters differently.