togethercomputer / together-python

The Official Python Client for Together's API
https://pypi.org/project/together/
Apache License 2.0
24 stars 5 forks source link

ENG-1594 : Job failed due to bad user input #69

Closed justusc closed 7 months ago

justusc commented 7 months ago

This PR adds a check to verify that fine tune training files are utf-8 compatible.

linear[bot] commented 7 months ago
ENG-1594 Job failed due to bad user input

Fine tune job failed due to an invalid unicode character. We should modify the file uploader in the cli to verify that the file is utf-8 compatible before uploading. ``` Traceback (most recent call last): File "/app/nebula/train.py", line 600, in main() File "/app/nebula/train.py", line 547, in main steps_per_epoch = calculate_training_steps(args, train_data_loader) File "/app/nebula/train.py", line 311, in calculate_training_steps token_count = train_data_loader.dataset.get_dataset_token_count() File "/app/nebula/tasks/data_loaders/data_utils.py", line 102, in get_dataset_token_count tokenized_datasets = raw_datasets.map( File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in Dataset._map_single(**dataset_kwargs): File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3474, in _map_single batch = apply_function_on_filtered_inputs( File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/app/nebula/tasks/data_loaders/data_utils.py", line 79, in tokenize_function examples["text"], padding=False, truncation=True, max_length=self.tokenizer.model_max_length, File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 272, in __getitem__ value = self.format(key) File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 375, in format return self.formatter.format_column(self.pa_table.select([key])) File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 441, in format_column column = self.python_arrow_extractor().extract_column(pa_table) File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 147, in extract_column return pa_table.column(0).to_pylist() File "pyarrow/table.pxi", line 1326, in pyarrow.lib.ChunkedArray.to_pylist File "pyarrow/array.pxi", line 1604, in pyarrow.lib.Array.to_pylist File "pyarrow/scalar.pxi", line 661, in pyarrow.lib.StringScalar.as_py UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 227: invalid start byte ```