In [1]: import cudf
In [2]: df = cudf.DataFrame({"str_col": ["a", "abb", "abc"], "a":[10, 1, 2]}, index=[10, 11, 12])
In [3]: df.to_orc("a", index=True)
In [4]: cudf.read_orc("a")
Out[4]:
str_col a
10 a 10
11 abb 1
12 abc 2
In [5]: cudf.read_orc("a", columns=['a'])
# At this stage metadata has 1 index col + 1 col, but the actual table has only 1 col,
# which leads to the following index error.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [5], line 1
----> 1 cudf.read_orc("a", columns=['a'])
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/io/orc.py:372, in read_orc(filepath_or_buffer, engine, columns, filters, stripes, skiprows, num_rows, use_index, timestamp_type, use_python_file_object, storage_options, bytes_per_thread)
368 stripes = selected_stripes
370 if engine == "cudf":
371 return DataFrame._from_data(
--> 372 *liborc.read_orc(
373 filepaths_or_buffers,
374 columns,
375 stripes,
376 skiprows,
377 num_rows,
378 use_index,
379 timestamp_type,
380 )
381 )
382 else:
384 def read_orc_stripe(orc_file, stripe, columns):
File orc.pyx:92, in cudf._lib.orc.read_orc()
File orc.pyx:138, in cudf._lib.orc.read_orc()
File utils.pyx:309, in cudf._lib.utils.data_from_unique_ptr()
IndexError: list index out of range
In [6]: df.to_orc("a", index=False)
In [7]: cudf.read_orc("a", columns=['a'])
Out[7]:
a
0 10
1 1
2 2
Expected behavior
Orc reader needs to behave like parquet reader where the index columns are read always irrespective of if their name is present in columns
Environment overview (please complete the following information)
Environment location: [Bare-metal]
Method of cuDF install: [from source]
Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
Describe the bug When an orc file consists of index columns, the reader seems to ignore reading the index columns if their name isn't in
columns
.Required to reproduce: Changes in https://github.com/rapidsai/cudf/pull/12025/
Steps/Code to reproduce bug
Expected behavior Orc reader needs to behave like parquet reader where the index columns are read always irrespective of if their name is present in
columns
Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context This issue surfaced while I was working on: https://github.com/rapidsai/cudf/pull/12025/