galipremsagar commented 1 year ago

Describe the bug When an orc file consists of index columns, the reader seems to ignore reading the index columns if their name isn't in columns.

Required to reproduce: Changes in https://github.com/rapidsai/cudf/pull/12025/

Steps/Code to reproduce bug

In [1]: import cudf

In [2]: df = cudf.DataFrame({"str_col": ["a", "abb", "abc"], "a":[10, 1, 2]}, index=[10, 11, 12])

In [3]: df.to_orc("a", index=True)

In [4]: cudf.read_orc("a")
   str_col   a
10       a  10
11     abb   1
12     abc   2

In [5]: cudf.read_orc("a", columns=['a'])  
# At this stage metadata has 1 index col + 1 col, but the actual table has only 1 col,
# which leads to the following index error.
IndexError                                Traceback (most recent call last)
Cell In [5], line 1
----> 1 cudf.read_orc("a", columns=['a'])

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/io/orc.py:372, in read_orc(filepath_or_buffer, engine, columns, filters, stripes, skiprows, num_rows, use_index, timestamp_type, use_python_file_object, storage_options, bytes_per_thread)
    368         stripes = selected_stripes
    370 if engine == "cudf":
    371     return DataFrame._from_data(
--> 372         *liborc.read_orc(
    373             filepaths_or_buffers,
    374             columns,
    375             stripes,
    376             skiprows,
    377             num_rows,
    378             use_index,
    379             timestamp_type,
    380         )
    381     )
    382 else:
    384     def read_orc_stripe(orc_file, stripe, columns):

File orc.pyx:92, in cudf._lib.orc.read_orc()

File orc.pyx:138, in cudf._lib.orc.read_orc()

File utils.pyx:309, in cudf._lib.utils.data_from_unique_ptr()

IndexError: list index out of range

In [6]: df.to_orc("a", index=False)

In [7]: cudf.read_orc("a", columns=['a'])
0  10
1   1
2   2

Expected behavior Orc reader needs to behave like parquet reader where the index columns are read always irrespective of if their name is present in columns

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context This issue surfaced while I was working on: https://github.com/rapidsai/cudf/pull/12025/

galipremsagar commented 1 year ago

