mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

train-student failed with UnicodeDecodeError #831

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

Is it possible the taskcluster fetches got corrupted?

https://firefox-ci-tc.services.mozilla.com/tasks/FC2YNEIiS0mPBWnLTpDQEw/runs/0/logs/public/logs/live.log

[task 2024-09-04T10:32:25.194Z] Traceback (most recent call last):
[task 2024-09-04T10:32:25.194Z]   File "/home/ubuntu/.local/bin/opustrainer-train", line 8, in <module>
[task 2024-09-04T10:32:25.194Z]     sys.exit(main())
[task 2024-09-04T10:32:25.194Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 862, in main
[task 2024-09-04T10:32:25.206Z]     for batch in state_tracker.run(trainer, batch_size=args.batch_size, chunk_size=args.chunk_size, processes=args.workers):
[task 2024-09-04T10:32:25.206Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 783, in run
[task 2024-09-04T10:32:25.206Z]     for batch in trainer.run(*args, **kwargs):
[task 2024-09-04T10:32:25.206Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 718, in run
[task 2024-09-04T10:32:25.206Z]     batch.extend(
[task 2024-09-04T10:32:25.206Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 718, in <genexpr>
[task 2024-09-04T10:32:25.206Z]     batch.extend(
[task 2024-09-04T10:32:25.206Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 261, in __next__
[task 2024-09-04T10:32:25.206Z]     self._read_line()
[task 2024-09-04T10:32:25.206Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 212, in _read_line
[task 2024-09-04T10:32:25.206Z]     self._next_line = self._fh.readline() # type: ignore # _fh can't be none.
[task 2024-09-04T10:32:25.206Z]   File "/usr/lib/python3.10/codecs.py", line 322, in decode
[task 2024-09-04T10:32:25.221Z]     (result, consumed) = self._buffer_decode(data, self.errors, final)
[task 2024-09-04T10:32:25.221Z] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 389: invalid continuation byte
gregtatum commented 1 month ago

Do we validate that our input data is valid utf-8 anywhere?