huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.84k stars 472 forks source link

Enough samples error: Make sure that your dataset has enough samples to at least yield one packed sequence. #764

Closed apple-1 closed 2 weeks ago

apple-1 commented 2 weeks ago

I am just doing a test training - with a small csv file of only 10 entries.

Tried to resolve by adding to params:

Packing: False Padding: Left

Also setting train_split: null in yaml.config

and adding max sequence = 128

Error:

ERROR | 2024-09-17 12:54:20 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1775, in _prepare_split_single num_examples, num_bytes = writer.finalize() File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\arrow_writer.py", line 611, in finalize raise SchemaInferenceError("Please pass features or at least one example when writing data") datasets.arrow_writer.SchemaInferenceError: Please pass features or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 642, in _prepare_packed_dataloader packed_dataset = Dataset.from_generator( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\arrow_dataset.py", line 1117, in from_generator return GeneratorDatasetInputStream( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\io\generator.py", line 47, in read self.builder.download_and_prepare( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1027, in download_and_prepare self._download_and_prepare( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1789, in _download_and_prepare super()._download_and_prepare( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1122, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1627, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1784, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\common.py", line 117, in wrapper return func(*args, *kwargs) File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\clm__main__.py", line 28, in train train_sft(config) File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\clm\train_clm_sft.py", line 46, in train trainer = SFTTrainer( File "C:\Users\sharm\anaconda3\lib\site-packages\huggingface_hub\utils_deprecation.py", line 101, in inner_f return f(args, **kwargs) File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 372, in init train_dataset = self._prepare_dataset( File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 534, in _prepare_dataset return self._prepare_packed_dataloader( File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 646, in _prepare_packed_dataloader raise ValueError( ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

ERROR | 2024-09-17 12:54:20 | autotrain.trainers.common:wrapper:121 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence. INFO | 2024-09-17 12:54:21 | autotrain.parser:run:217 - Job ID: 12572

apple-1 commented 2 weeks ago

data.csv contains email subject lines in text column, and complete email text in the next column.

apple-1 commented 2 weeks ago

Reducing the block_size to 128 or 64 does the trick.