Inconsistent UnicodeEncodeError for each Config

Charmelink commented 2 years ago

Hi, I am trying to run this project as described in the readme. I completed installation and tried to run a config, but after trying each config I have been stopped at a UnicodeEncodeError. Each traceback is slightly different; the e2e_clean is the only one that makes it to training, but also crashes due to a UnicodeEncodeError after Epoch 0.

Here's a couple tracebacks for example. For webnlg17 Traceback (most recent call last): File "finetune.py", line 932, in model = main(args) File "finetune.py", line 902, in main logger=logger, File "/workspace/ControlPrefixes-main/src/datatotext/lightning_base.py", line 634, in generic_train trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit self.dispatch() File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train self.run_sanity_check(self.lightning_module) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanitycheck , eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 742, in run_evaluation deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end() File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 189, in evaluation_epoch_end deprecated_results = self.run_eval_epoch_end(self.num_dataloaders) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 227, in run_eval_epoch_end eval_results = model.validation_epoch_end(eval_results) File "finetune.py", line 345, in validation_epoch_end convert_text(s) + "\n" for s in output_batch["target"] UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 9: ordinal not in range(128)

For DART

Traceback (most recent call last): File "finetune.py", line 932, in model = main(args) File "finetune.py", line 902, in main logger=logger, File "/workspace/ControlPrefixes-main/src/datatotext/lightning_base.py", line 634, in generic_train trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit self.call_setup_hook(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1066, in call_setup_hook model.setup(stage_name) File "/workspace/ControlPrefixes-main/src/datatotext/lightning_base.py", line 286, in setup "train", self.hparams.train_batch_size, shuffle=True File "finetune.py", line 610, in get_dataloader dataset = self.get_dataset(type_path) File "finetune.py", line 603, in get_dataset **self.dataset_kwargs, File "/workspace/ControlPrefixes-main/src/datatotext/utils.py", line 610, in init self.src_lens = self.get_char_lens(self.src_file) File "/workspace/ControlPrefixes-main/src/datatotext/utils.py", line 633, in get_char_lens return [len(x) for x in Path(data_file).open().readlines()] File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 6422: ordinal not in range(128)

Every config does this at some point; let me know if you need more information. I tried moving the data around, unzipping it differently, rolling pytorch-lightning back to an older/newer version, but nothing seems to work. Is there some unspoken data processing step that needs to be done before training? Thanks, CH

jordiclive commented 2 years ago

No. it looks like it needs utf-8 specified on your machine when using open. I will look into it. In the mean time can you check:

import sys
sys.getdefaultencoding()

and if not utf-8 change it.

Charmelink commented 2 years ago

This seemed to be an issue with the way I downloaded the project. Using git clone fixed the issue. It only happened when downloading from github directly. Thanks!

jordiclive / ControlPrefixes

Inconsistent UnicodeEncodeError for each Config #6