huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
8.61k stars 1.06k forks source link

TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc' with streaming datasets #1741

Open mrbesher opened 2 weeks ago

mrbesher commented 2 weeks ago

I encountered a TypeError when using streaming datasets, num_proc does not exist in IterableDataset.map().

Error logs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 12
      5 dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)
      7 sft_config = SFTConfig(output_dir="output",
      8                        report_to="none",
      9                        dataset_text_field="Text",
     10                        max_seq_length=8)
---> 12 trainer = SFTTrainer(
     13         model_id,
     14         args=sft_config,
     15         train_dataset=dataset["train"],
     16         eval_dataset=dataset["test"]
     17     )
     18 trainer.train()

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:101, in _deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     99         message += "\n\n" + custom_message
    100     warnings.warn(message, FutureWarning)
--> 101 return f(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:362, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs, eval_packing)
    360     args.dataset_kwargs = {}
    361 if train_dataset is not None:
--> 362     train_dataset = self._prepare_dataset(
    363         train_dataset,
    364         tokenizer,
    365         args.packing,
    366         args.dataset_text_field,
    367         args.max_seq_length,
    368         formatting_func,
    369         args.num_of_sequences,
    370         args.chars_per_token,
    371         remove_unused_columns=args.remove_unused_columns if args is not None else True,
    372         **args.dataset_kwargs,
    373     )
    374 if eval_dataset is not None:
    375     _multiple = isinstance(eval_dataset, dict)

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:508, in SFTTrainer._prepare_dataset(self, dataset, tokenizer, packing, dataset_text_field, max_seq_length, formatting_func, num_of_sequences, chars_per_token, remove_unused_columns, append_concat_token, add_special_tokens, skip_prepare_dataset)
    505     return dataset
    507 if not packing:
--> 508     return self._prepare_non_packed_dataloader(
    509         tokenizer,
    510         dataset,
    511         dataset_text_field,
    512         max_seq_length,
    513         formatting_func,
    514         add_special_tokens,
    515         remove_unused_columns,
    516     )
    518 else:
    519     return self._prepare_packed_dataloader(
    520         tokenizer,
    521         dataset,
   (...)
    528         add_special_tokens,
    529     )

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:576, in SFTTrainer._prepare_non_packed_dataloader(self, tokenizer, dataset, dataset_text_field, max_seq_length, formatting_func, add_special_tokens, remove_unused_columns)
    570 if not remove_unused_columns and len(extra_columns) > 0:
    571     warnings.warn(
    572         "You passed `remove_unused_columns=False` on a non-packed dataset. This might create some issues with the default collator and yield to errors. If you want to "
    573         f"inspect dataset other columns (in this case {extra_columns}), you can subclass `DataCollatorForLanguageModeling` in case you used the default collator and create your own data collator in order to inspect the unused dataset columns."
    574     )
--> 576 tokenized_dataset = dataset.map(
    577     tokenize,
    578     batched=True,
    579     remove_columns=dataset.column_names if remove_unused_columns else None,
    580     num_proc=self.dataset_num_proc,
    581     batch_size=self.dataset_batch_size,
    582 )
    584 return tokenized_dataset

TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'

Reproduction Steps:

  1. Install trl, transformers, accelerate, bitsandbytes, and datasets using the following versions:
    trl==0.9.4
    transformers==4.41.2
    accelerate==0.31.0
    bitsandbytes==0.43.1
    datasets==2.20.0
  2. Run the following code:
    
    from datasets import load_dataset
    from trl import SFTTrainer, SFTConfig

model_id = "/kaggle/working/toyllama" dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)

sft_config = SFTConfig(output_dir="output", report_to="none", dataset_text_field="Text", max_seq_length=8, max_steps=10)

trainer = SFTTrainer( model_id, args=sft_config, train_dataset=dataset["train"], eval_dataset=dataset["test"] ) trainer.train()


**Environment (probably not relevant):**

* `Accelerate` version: 0.31.0
* Platform: Linux-5.15.133+-x86_64-with-glibc2.31
* Python version: 3.10.13
* Numpy version: 1.26.4
* PyTorch version (GPU?): 2.1.2 (True)
* PyTorch XPU available: False
* PyTorch NPU available: False
* PyTorch MLU available: False
* System RAM: 31.36 GB
* GPU type: Tesla T4
maliozer commented 6 days ago

same issue, is there any solution for this?

@younesbelkada