kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

PyPI version 0.5.0 yields different results to PyPI version 0.4.0 #107

Closed sanchit-gandhi closed 1 year ago

sanchit-gandhi commented 1 year ago

Hey pyctcdecode team 👋

Thanks for your awesome work on this library! Loving the easy integration with HF Transformers 🤗

Upgrading from PyPI version 0.4.0 to 0.5.0 yields quite different results with BeamSearchDecoderCTC. This is currently causing a failing assertion test on Transformers: https://github.com/huggingface/transformers/pull/21226

Is this difference expected? We can update the test if so, but such differences between versions ring alarm bells for a silent regression!

Code snippet to repro:

from pyctcdecode import BeamSearchDecoderCTC
import numpy as np
from multiprocessing import get_context

# load a dummy n-gram LM from the HF Hub
decoder_name = "hf-internal-testing/ngram-beam-search-decoder"
decoder = BeamSearchDecoderCTC.load_from_hf_hub(decoder_name)

def get_dummy_logits(shape=(1, 10, 16), seed=77):
    np.random.seed(seed)
    return np.random.rand(*shape)

logits = get_dummy_logits()
logits_list = [array for array in logits]

beam_width = 20
beam_prune_logp = -20.0
token_min_logp = -4.0

with get_context("fork").Pool() as pool:
    decoded_decoder_out = decoder.decode_beams_batch(
        pool,
        logits_list,
        beam_width=beam_width,
        beam_prune_logp=beam_prune_logp,
        token_min_logp=token_min_logp,
    )

decoded_decoder = [d[0][0] for d in decoded_decoder_out]
print("Decoded outputs: ", decoded_decoder)

logit_scores = [d[0][2] for d in decoded_decoder_out]
print("Logit scores: ", logit_scores)

lm_scores = [d[0][3] for d in decoded_decoder_out]
print("LM scores: ", lm_scores)

With v0.4.0:

Decoded outputs:  ['<s> </s> </s>]
Logit scores:  [-19.08310264225205]
LM scores:  [-14.58310264225205]

With v0.5.0:

Decoded outputs:  ['<s> <s> </s>']
Logit scores:  [-19.195725378349803]
LM scores:  [-14.695725378349803]

We observe that the logit/lm scores are significantly different (outside the diff we could attribute to numerical precision).

sanchit-gandhi commented 1 year ago

Gently pinging @lopez86 :)

lopez86 commented 1 year ago

hi @sanchit-gandhi, thanks for reporting this. In 0.4.0 to 0.5.0 there were a couple bugs that were fixed that could have an affect on scoring. It looks like this is a sort of contrived short test case, so I'm not surprised that there's a noticeable difference in scores. I think for more realistic inputs you should see results that are very similar, but not necessarily identical. If there are large differences in outputs on realistic cases, then there would be a problem. My expectation would be that in most cases, the final text is more or less the same with a small difference in scoring. See https://github.com/kensho-technologies/pyctcdecode/pull/96 https://github.com/kensho-technologies/pyctcdecode/pull/98

sanchit-gandhi commented 1 year ago

Hey @lopez86, thanks very much for getting back to me here. Indeed, this test case is quite contrived in order that it 1) runs quickly and 2) accentuates the numerical differences obtained by upgrading to v0.5.0. Good to know that these differences should be less of a problem for more realistic use cases. I'll observe the scores for larger models and more realistic inputs and report back here if there are large numerical differences 🙌 Thanks for highlighting the PR's that incorporated the bug fixes, much appreciated!

sanchit-gandhi commented 1 year ago

Going to close this as the differences are insignificant for real-world use cases. Thanks for your help here @lopez86