tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.56k stars 3.51k forks source link

UnicodeDecodeError Librispeech #1351

Open w4-artychen opened 5 years ago

w4-artychen commented 5 years ago

Description I have been trying to train ASR on librispeech using transformer model and when I am trying to see the results using t2t-decoder, specifically using these commands t2t-decoder --data_dir=$DECODE_FILE --problem=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --decode_from_file=$DECODE_FILE --decode_to_file=translation.txt I get the following error

Traceback (most recent call last):
  File "/anaconda3/envs/doodle-mac-cpu/bin/t2t-decoder", line 17, in <module>
    tf.app.run()
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/anaconda3/envs/doodle-mac-cpu/bin/t2t-decoder", line 12, in main
    t2t_decoder.main(argv)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensor2tensor/bin/t2t_decoder.py", line 193, in main
    decode(estimator, hp, decode_hp)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensor2tensor/bin/t2t_decoder.py", line 93, in decode
    checkpoint_path=FLAGS.checkpoint_path)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensor2tensor/utils/decoding.py", line 365, in decode_from_file
    sorted_inputs, sorted_keys = _get_sorted_inputs(filename, decode_hp.delimiter)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensor2tensor/utils/decoding.py", line 733, in _get_sorted_inputs
    text = f.read()
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 132, in read
    pywrap_tensorflow.ReadFromStream(self._read_buf, length, status))
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 100, in _prepare_value
    return compat.as_str_any(val)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 107, in as_str_any
    return as_str(value)
  File "/anaconda3/envs/doodle-mac-cpu/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 80, in as_text
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 19: invalid continuation byte

Could anyone tell me what I am doing wrong here? TIA.

Environment:

mesh-tensorflow==0.0.5
tensor2tensor==1.11.0
tensorboard==1.12.1
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0
Python 3.6.2 :: Continuum Analytics, Inc.
Souis-41 commented 5 years ago

turns out decoding.decode_from_file works in a most troublesome way

best practice might be to write a trancribe function yourself in a similar manner from the following tutorial

https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/asr_transformer.ipynb

hope it helps

hl312 commented 4 years ago

Hi , have you solved it ? When t2t_decode , an error comes, it says "the location in docding:586:inputs_ids=vocabulary.encode(inputs) , 'NoneType' object has no atrribute 'encode'" . The content in decode_from_file is whether "the path to real testdata file" or "testfile file" ?