Closed lu-wang-dl closed 1 year ago
if i read spark Dataframe , got an error on multi-node Spark cluster. Did the Api (Dataset.from_spark) support Spark cluster, read dataframe and save_to_disk?
Error: _pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforma tion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. 23/06/16 21:17:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
How to perform predictions on Dataset object in Spark with multi-node cluster parallelism?
Addressed in #5701
Hi ! for your information we are working on some more documentation on how to use Spark with HF Datasets repositories (without the need for the datasets
library) https://github.com/huggingface/datasets/issues/5678
Cc @lu-wang-dl @maddiedawson let me know what you think !
sorry, wrong link: https://github.com/huggingface/hub-docs/pull/1392
Feature request
Add a new API
Dataset.from_spark
to create a Dataset from Spark DataFrame.Motivation
Spark is a distributed computing framework that can handle large datasets. By supporting loading Spark DataFrames directly into Hugging Face Datasets, we enable take the advantages of spark to processing the data in parallel.
By providing a seamless integration between these two frameworks, we make it easier for data scientists and developers to work with both Spark and Hugging Face in the same workflow.
Your contribution
We can discuss about the ideas and I can help preparing a PR for this feature.