Add support to create a Dataset from spark dataframe

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.22k stars 2.69k forks source link

Add support to create a Dataset from spark dataframe #5678

Closed lu-wang-dl closed 1 year ago

lu-wang-dl commented 1 year ago

Feature request

Add a new API Dataset.from_spark to create a Dataset from Spark DataFrame.

Motivation

Spark is a distributed computing framework that can handle large datasets. By supporting loading Spark DataFrames directly into Hugging Face Datasets, we enable take the advantages of spark to processing the data in parallel.

By providing a seamless integration between these two frameworks, we make it easier for data scientists and developers to work with both Spark and Hugging Face in the same workflow.

Your contribution

We can discuss about the ideas and I can help preparing a PR for this feature.

yanzia12138 commented 1 year ago

if i read spark Dataframe , got an error on multi-node Spark cluster. Did the Api (Dataset.from_spark) support Spark cluster, read dataframe and save_to_disk?

Error: _pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforma tion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. 23/06/16 21:17:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

oakkas84 commented 1 year ago

How to perform predictions on Dataset object in Spark with multi-node cluster parallelism?

mariosasko commented 1 year ago

Addressed in #5701

lhoestq commented 2 months ago

Hi ! for your information we are working on some more documentation on how to use Spark with HF Datasets repositories (without the need for the datasets library) ~~https://github.com/huggingface/datasets/issues/5678~~ Cc @lu-wang-dl @maddiedawson let me know what you think !

lhoestq commented 2 months ago

sorry, wrong link: https://github.com/huggingface/hub-docs/pull/1392