rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.39k stars 897 forks source link

[FEA] Improve cudf.read_csv empty file error message #9676

Open randerzander opened 2 years ago

randerzander commented 2 years ago

It's embarrassingly common to accidentally produce empty "CSV" files, then for a downstream system to fail on attempting to read them.

If I'm trying to read an empty file with Pandas, I get a helpful error message indicating the problem.

import pandas as pd

with open('test.csv', 'w') as fp:
    fp.write('')

pd.read_csv('test.csv')
---------------------------------------------------------------------------
EmptyDataError                            Traceback (most recent call last)
/tmp/ipykernel_3734411/2382990058.py in <module>
      4     fp.write('')
      5 
----> 6 pd.read_csv('test.csv')

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     67         kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
     68         try:
---> 69             self._reader = parsers.TextReader(self.handles.handle, **kwds)
     70         except Exception:
     71             self.handles.close()

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

EmptyDataError: No columns to parse from file

When I try to read an empty CSV file from Dask-cudf (or cudf directly), it's not clear if I perhaps OOMed or some other non-input-file-related problem:

~/conda/envs/dsql-dask-main/lib/python3.8/site-packages/cudf-21.12.0a0+254.g84e5a03032-py3.8-linux-x86_64.egg/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, use_python_file_object, **kwargs)
     78         na_values = [na_values]
     79 
---> 80     return libcudf.csv.read_csv(
     81         filepath_or_buffer,
     82         lineterminator=lineterminator,

cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()

cudf/_lib/csv.pyx in cudf._lib.csv.make_csv_reader_options()

cudf/_lib/io/utils.pyx in cudf._lib.io.utils.make_source_info()

IndexError: Out of bounds on buffer access (axis 0)
github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

galipremsagar commented 2 years ago

With the recent changes to csv reader, the current behavior seems to be that we are returning an empty dataframe. To me, it seems like an okay behavior?:

In [4]: import pandas as pd
   ...: 
   ...: with open('test.csv', 'w') as fp:
   ...:     fp.write('')
   ...: 

In [5]: cudf.read_csv("test.csv")
Out[5]: 
Empty DataFrame
Columns: []
Index: []

@randerzander is this an agreeable behavior in this scenario? or do you think otherwise?

vyasr commented 5 months ago

The above is still the status quo here:

>>> import pandas as pd
>>> import cudf
>>> with open("test.csv", "w") as fp:
...     fp.write("")
... 
0
>>> pd.read_csv("test.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/coder/.conda/envs/rapids/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/coder/.conda/envs/rapids/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/coder/.conda/envs/rapids/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/coder/.conda/envs/rapids/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/coder/.conda/envs/rapids/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
>>> cudf.read_csv("test.csv")
Empty DataFrame
Columns: []
Index: []

Given the growth of cudf.pandas since this issue was first created, we probably want to aim for closer matching now.