Open ayushdg opened 4 years ago
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Will this be closed by https://github.com/rapidsai/cudf/pull/9103 @rjzamora ?
Is your feature request related to a problem? Please describe. Often, datasets stored in the orc format are partitioned (based on standard hive partitioning format) similar to parquet. Dask_cudf (and dask_dataframe) currently supports reading partitioned parquet datasets, but does not support reading partitioned orc datasets.
Describe the solution you'd like
dask_cudf.read_orc
works when provided a path to a partitioned orc dataset, without errors (similar to how this works now forread_parquet
). If the solution is general, this could be upstreamed to dask dataframe as well.Describe alternatives you've considered Current alternatives would involve walking through subfolders and reading the orc files separately, while using some custom logic (like looking at folder names) to determine the values for the partitioned columns.
Additional context Here is an example of a partitioned orc dataset. test_orc.zip
This is the existing output when trying to read this dataset with dask_cudf