Allow to give the dataset multiprocessing_context

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

135.19k stars 27.06k forks source link

Allow to give the dataset multiprocessing_context #34793

Open ierezell opened 1 day ago

ierezell commented 1 day ago

Feature request

In Huggingface Trainer, allow to pass the multiprocessing context : https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Motivation

For a dataset that is loaded on multiple cpu cores, sometimes the fork method creates problems (with polars for example) and the spawn method is more adapted.

Your contribution

I could do a PR. A fix could be to add one more parameter to Trainer and pass it to the Dataloader down the line.

Rocketknight1 commented 6 hours ago

This looks like a crossover datasets/Trainer issue, so cc @lhoestq @SunMarc @muellerzr