Closed stmio closed 1 year ago
@vuule If you have a moment, can you look into this? This error is being raised in the Python level: https://github.com/rapidsai/cudf/blob/deec3f8f981fd89f1aa46c6aea3714fd7c7355b9/python/cudf/cudf/_lib/csv.pyx#L347-L348
I think we need to verify if this error's stated restriction between byte range support and skipping rows exists at the C++ level. I took a brief glance over https://github.com/rapidsai/cudf/blob/branch-23.08/cpp/src/io/csv/csv_gpu.cu but didn't see any obvious limitations in the docs or verification in the implementation. I would prefer to raise the errors like this one in C++ if it is indeed a restriction of the API.
Ah, nevermind. I found there is an error raised here: https://github.com/rapidsai/cudf/blob/deec3f8f981fd89f1aa46c6aea3714fd7c7355b9/cpp/include/cudf/io/csv.hpp#L576
Perhaps we can avoid the duplicate error check between Python and C++. @vuule I'd defer to your expertise here on whether checking this at both layers is necessary.
I don't think the duplicate check is necessary, since there's no code path that does not hit the C++ level check. But I don't know if Python layer prefers to check ASAP instead of delegating to C++ (CC @galipremsagar ).
@stmio Can you use the header
parameter to skip the invalid rows?
Hi @vuule, just tested it with the header
parameter and it works with cudf, but not with dask_cudf:
Traceback (most recent call last):
File "/home/sam/dask-test/main.py", line 3, in <module>
data = dask_cudf.read_csv("./data.csv", header=3).set_index("A")
File "/home/sam/miniconda3/envs/rapids-23.04/lib/python3.10/site-packages/dask_cudf/io/csv.py", line 90, in read_csv
return _internal_read_csv(path=path, blocksize=blocksize, **kwargs)
File "/home/sam/miniconda3/envs/rapids-23.04/lib/python3.10/site-packages/dask_cudf/io/csv.py", line 139, in _internal_read_csv
meta = dask_reader(filenames[0], **kwargs1)._meta
File "/home/sam/miniconda3/envs/rapids-23.04/lib/python3.10/site-packages/dask/dataframe/io/csv.py", line 755, in read
return read_pandas(
File "/home/sam/miniconda3/envs/rapids-23.04/lib/python3.10/site-packages/dask/dataframe/io/csv.py", line 618, in read_pandas
header = b"" if header is None else parts[firstrow] + b_lineterminator
IndexError: list index out of range
Describe the bug
When reading csv files with dask cudf, using the
skiprows
orskipfooter
parameters causes the following error:Steps/Code to reproduce bug
main.py
data.csv
Running the same code with cudf instead of dask_cudf works as expected.
Full traceback: traceback.txt
Expected behavior
CSV file is read and stored as a dask dataframe, skipping the first three rows that do not contain any valid data.
Environment overview
Environment details
Click here to see environment details