parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
829 stars 245 forks source link

Invalid UTF-8 #108

Open cathoderaymission opened 5 years ago

cathoderaymission commented 5 years ago
vocab = ["A", "B", "C", "D", " "]
decoder = CTCBeamDecoder(vocab, beam_width=5,  blank_id=vocab.index(' '), log_probs_input=True)
decoder.decode(out)
py3/lib/python3.7/site-packages/ctcdecode/__init__.py in decode(self, probs, seq_lens)
     38             ctc_decode.paddle_beam_decode(probs, seq_lens, self._labels, self._num_labels, self._beam_width, self._num_processes,
     39                                           self._cutoff_prob, self.cutoff_top_n, self._blank_id, self._log_probs,
---> 40                                           output, timesteps, scores, out_seq_len)
     41 
     42         return output, scores, timesteps, out_seq_len

RuntimeError: Invalid UTF-8

Where out is a tensor of shape [batch, seq, probs] eg torch.Size([400, 300, 5])

I've tried smaller beam widths and using one sample instead of an entire batch, and I still can't get this to work.

Nothing in the code really provides much of an indication as to why I'm getting this error.