rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.31k stars 886 forks source link

[FEA] Dask_cudf support to read partitioned orc datasets #5967

Open ayushdg opened 4 years ago

ayushdg commented 4 years ago

Is your feature request related to a problem? Please describe. Often, datasets stored in the orc format are partitioned (based on standard hive partitioning format) similar to parquet. Dask_cudf (and dask_dataframe) currently supports reading partitioned parquet datasets, but does not support reading partitioned orc datasets.

Describe the solution you'd like dask_cudf.read_orc works when provided a path to a partitioned orc dataset, without errors (similar to how this works now for read_parquet). If the solution is general, this could be upstreamed to dask dataframe as well.

Describe alternatives you've considered Current alternatives would involve walking through subfolders and reading the orc files separately, while using some custom logic (like looking at folder names) to determine the values for the partitioned columns.

Additional context Here is an example of a partitioned orc dataset. test_orc.zip

This is the existing output when trying to read this dataset with dask_cudf

In [1]: import dask_cudf                                                                                                                                               
In [2]: dask_cudf.read_orc("test_orc")                                                                                                                                 
---------------------------------------------------------------------------
IsADirectoryError                         Traceback (most recent call last)
<ipython-input-2-711da5386bc8> in <module>
----> 1 dask_cudf.read_orc("test_orc")

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/orc.py in read_orc(path, columns, storage_options, **kwargs)
     47     nstripes_per_file = []
     48     for path in paths:
---> 49         with fs.open(path, "rb") as f:
     50             o = orc.ORCFile(f)
     51             if schema is None:

/opt/conda/envs/rapids/lib/python3.7/site-packages/fsspec/spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
    842                 autocommit=ac,
    843                 cache_options=cache_options,
--> 844                 **kwargs
    845             )
    846             if not ac:

/opt/conda/envs/rapids/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self, path, mode, block_size, **kwargs)
    113         if self.auto_mkdir and "w" in mode:
    114             self.makedirs(self._parent(path), exist_ok=True)
--> 115         return LocalFileOpener(path, mode, fs=self, **kwargs)
    116 
    117     def touch(self, path, **kwargs):

/opt/conda/envs/rapids/lib/python3.7/site-packages/fsspec/implementations/local.py in __init__(self, path, mode, autocommit, fs, **kwargs)
    195         self.autocommit = autocommit
    196         self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 197         self._open()
    198 
    199     def _open(self):

/opt/conda/envs/rapids/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self)
    200         if self.f is None or self.f.closed:
    201             if self.autocommit or "w" not in self.mode:
--> 202                 self.f = open(self.path, mode=self.mode)
    203             else:
    204                 # TODO: check if path is writable?

IsADirectoryError: [Errno 21] Is a directory: '/workdir/data/test_orc'
github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

beckernick commented 3 years ago

Will this be closed by https://github.com/rapidsai/cudf/pull/9103 @rjzamora ?