ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.04k stars 5.78k forks source link

[Datasets] -> Adding meta during to_dask() call #27502

Closed arnavgarg1 closed 2 years ago

arnavgarg1 commented 2 years ago

Description

In the Ray Dataset implementation of to_dask(), there currently isn't a way to pass in a custom meta object during Dask DataFrame creation, nor is there an internal implementation currently to infer the meta via a Pyarrow-esque schema or something else like a Pandas DataFrame. This can cause problems in scenarios where the Dask Partitions have single rows with some NaNs in them, causing metadata mismatch issues.

Could we add this in? I have a custom implementation here: https://github.com/ludwig-ai/ludwig/pull/2318/files that works but I do believe there's a better way to do this. Happy to discuss!

Use case

Trying to make sure that Dask DataFrames are created from Ray with the right meta information so that downstream tasks don't throw metadata mismatch issues.

xwjiang2010 commented 2 years ago

cc @c21 @jianoaix @clarkzinzow