Open mrbesher opened 2 weeks ago
I encountered a TypeError when using streaming datasets, num_proc does not exist in IterableDataset.map().
TypeError
num_proc
IterableDataset.map()
Error logs:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[9], line 12 5 dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True) 7 sft_config = SFTConfig(output_dir="output", 8 report_to="none", 9 dataset_text_field="Text", 10 max_seq_length=8) ---> 12 trainer = SFTTrainer( 13 model_id, 14 args=sft_config, 15 train_dataset=dataset["train"], 16 eval_dataset=dataset["test"] 17 ) 18 trainer.train() File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:101, in _deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs) 99 message += "\n\n" + custom_message 100 warnings.warn(message, FutureWarning) --> 101 return f(*args, **kwargs) File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:362, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs, eval_packing) 360 args.dataset_kwargs = {} 361 if train_dataset is not None: --> 362 train_dataset = self._prepare_dataset( 363 train_dataset, 364 tokenizer, 365 args.packing, 366 args.dataset_text_field, 367 args.max_seq_length, 368 formatting_func, 369 args.num_of_sequences, 370 args.chars_per_token, 371 remove_unused_columns=args.remove_unused_columns if args is not None else True, 372 **args.dataset_kwargs, 373 ) 374 if eval_dataset is not None: 375 _multiple = isinstance(eval_dataset, dict) File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:508, in SFTTrainer._prepare_dataset(self, dataset, tokenizer, packing, dataset_text_field, max_seq_length, formatting_func, num_of_sequences, chars_per_token, remove_unused_columns, append_concat_token, add_special_tokens, skip_prepare_dataset) 505 return dataset 507 if not packing: --> 508 return self._prepare_non_packed_dataloader( 509 tokenizer, 510 dataset, 511 dataset_text_field, 512 max_seq_length, 513 formatting_func, 514 add_special_tokens, 515 remove_unused_columns, 516 ) 518 else: 519 return self._prepare_packed_dataloader( 520 tokenizer, 521 dataset, (...) 528 add_special_tokens, 529 ) File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:576, in SFTTrainer._prepare_non_packed_dataloader(self, tokenizer, dataset, dataset_text_field, max_seq_length, formatting_func, add_special_tokens, remove_unused_columns) 570 if not remove_unused_columns and len(extra_columns) > 0: 571 warnings.warn( 572 "You passed `remove_unused_columns=False` on a non-packed dataset. This might create some issues with the default collator and yield to errors. If you want to " 573 f"inspect dataset other columns (in this case {extra_columns}), you can subclass `DataCollatorForLanguageModeling` in case you used the default collator and create your own data collator in order to inspect the unused dataset columns." 574 ) --> 576 tokenized_dataset = dataset.map( 577 tokenize, 578 batched=True, 579 remove_columns=dataset.column_names if remove_unused_columns else None, 580 num_proc=self.dataset_num_proc, 581 batch_size=self.dataset_batch_size, 582 ) 584 return tokenized_dataset TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'
Reproduction Steps:
trl
transformers
accelerate
bitsandbytes
datasets
trl==0.9.4 transformers==4.41.2 accelerate==0.31.0 bitsandbytes==0.43.1 datasets==2.20.0
from datasets import load_dataset from trl import SFTTrainer, SFTConfig
model_id = "/kaggle/working/toyllama" dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)
sft_config = SFTConfig(output_dir="output", report_to="none", dataset_text_field="Text", max_seq_length=8, max_steps=10)
trainer = SFTTrainer( model_id, args=sft_config, train_dataset=dataset["train"], eval_dataset=dataset["test"] ) trainer.train()
**Environment (probably not relevant):** * `Accelerate` version: 0.31.0 * Platform: Linux-5.15.133+-x86_64-with-glibc2.31 * Python version: 3.10.13 * Numpy version: 1.26.4 * PyTorch version (GPU?): 2.1.2 (True) * PyTorch XPU available: False * PyTorch NPU available: False * PyTorch MLU available: False * System RAM: 31.36 GB * GPU type: Tesla T4
same issue, is there any solution for this?
@younesbelkada
I encountered a
TypeError
when using streaming datasets,num_proc
does not exist inIterableDataset.map()
.Error logs:
Reproduction Steps:
trl
,transformers
,accelerate
,bitsandbytes
, anddatasets
using the following versions:model_id = "/kaggle/working/toyllama" dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)
sft_config = SFTConfig(output_dir="output", report_to="none", dataset_text_field="Text", max_seq_length=8, max_steps=10)
trainer = SFTTrainer( model_id, args=sft_config, train_dataset=dataset["train"], eval_dataset=dataset["test"] ) trainer.train()