In the Ray Dataset implementation of to_dask(), there currently isn't a way to pass in a custom meta object during Dask DataFrame creation, nor is there an internal implementation currently to infer the meta via a Pyarrow-esque schema or something else like a Pandas DataFrame. This can cause problems in scenarios where the Dask Partitions have single rows with some NaNs in them, causing metadata mismatch issues.
Trying to make sure that Dask DataFrames are created from Ray with the right meta information so that downstream tasks don't throw metadata mismatch issues.
Description
In the Ray Dataset implementation of
to_dask()
, there currently isn't a way to pass in a custom meta object during Dask DataFrame creation, nor is there an internal implementation currently to infer the meta via a Pyarrow-esque schema or something else like a Pandas DataFrame. This can cause problems in scenarios where the Dask Partitions have single rows with some NaNs in them, causing metadata mismatch issues.Could we add this in? I have a custom implementation here: https://github.com/ludwig-ai/ludwig/pull/2318/files that works but I do believe there's a better way to do this. Happy to discuss!
Use case
Trying to make sure that Dask DataFrames are created from Ray with the right meta information so that downstream tasks don't throw metadata mismatch issues.