From my understanding of CTC, it results in alignment-free outputs -- that is, sequences in the transcribed output do not correspond to a unique segment of input logits. Yet, the beams produced by _decode_logits() (and hence decode_beams()) include such text_frames for each token of output text. They seem to correspond to any arbitrary alignment, as when beams are merged in https://github.com/kensho-technologies/pyctcdecode/blob/81514695348f25e71577cb191d2a77f6b1d7f884/pyctcdecode/decoder.py#L118 the text_frames from last beam/alignment processed overwrite those from any previous alignment for a given prefix.
Can someone confirm these text_frames are indeed arbitrary in this sense? (the only possibility for an alternative I can imagine is if the alignments/beams are added and processed in such a way that the lowest or highest indices always come last and are ultimately the ones reported, but this seems unlikely to me...)
And, if arbitrary, have any attempts been made to determine how variable these are? I suppose the variance would increase with the beam width, but just curious if anyone has looked more concretely into this, either as part of this repo or elsewhere.
From my understanding of CTC, it results in alignment-free outputs -- that is, sequences in the transcribed output do not correspond to a unique segment of input logits. Yet, the beams produced by
_decode_logits()
(and hencedecode_beams()
) include suchtext_frames
for each token of output text. They seem to correspond to any arbitrary alignment, as when beams are merged in https://github.com/kensho-technologies/pyctcdecode/blob/81514695348f25e71577cb191d2a77f6b1d7f884/pyctcdecode/decoder.py#L118 thetext_frames
from last beam/alignment processed overwrite those from any previous alignment for a given prefix.text_frames
are indeed arbitrary in this sense? (the only possibility for an alternative I can imagine is if the alignments/beams are added and processed in such a way that the lowest or highest indices always come last and are ultimately the ones reported, but this seems unlikely to me...)