huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.02k stars 2.63k forks source link

Local machine/cluster Beam Datasets example/tutorial #937

Closed shangw-nvidia closed 6 months ago

shangw-nvidia commented 3 years ago

Hi,

I'm wondering if https://huggingface.co/docs/datasets/beam_dataset.html has an non-GCP or non-Dataflow version example/tutorial? I tried to migrate it to run on DirectRunner and SparkRunner, however, there were way too many runtime errors that I had to fix during the process, and even so I wasn't able to get either runner correctly producing the desired output.

Thanks! Shang

lhoestq commented 3 years ago

I tried to make it run once on the SparkRunner but it seems that this runner has some issues when it is run locally. From my experience the DirectRunner is fine though, even if it's clearly not memory efficient.

It would be awesome though to make it work locally on a SparkRunner ! Did you manage to make your processing work ?

mariosasko commented 6 months ago

We've deprecated the Beam API in datasets. As part of it, the Beam datasets have also been converted to non-Beam-based to make using them straightforward.