rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.19k stars 881 forks source link

cudf raises Error regarding skiprows/headers when no skiprow or header specified #1258

Closed larryzhang95 closed 5 years ago

larryzhang95 commented 5 years ago

The bug I'm encountering is when I'm trying to read a csv into a cuda dataframe.

My Environment is : Ubuntu 18.04, Cuda 10.0, Python 3.6 -- I downloaded the respective libraries via the pip installation on the RAPIDS.ai website, on my own desktop machine

The code i execute:

import cudf import cuml

insurance_sample = cudf.read_csv('FL_insurance_sample.csv',delimiter=',')

Error Trace: ERROR: 19 in Number of records is too small for the specified skiprows and header parameters Traceback (most recent call last): File "rapids_examples.py", line 6, in insurance_sample = cudf.read_csv('FL_insurance_sample.csv',delimiter=',') File "/home/lazhang/.local/lib/python3.6/site-packages/cudf/io/csv.py", line 286, in read_csv libgdf.read_csv(csv_reader) File "/home/lazhang/.local/lib/python3.6/site-packages/libgdf_cffi/wrapper.py", line 27, in wrap raise GDFError(errname, msg) libgdf_cffi.wrapper.GDFError: GDF_FILE_ERROR

I will try to create a conda installation and see if this error occurs. Will update bug if the conda version works properly.

larryzhang95 commented 5 years ago

I installed via conda and ran into the same issue.

I was able to create a workaround via first creating the dataframe in pandas, preprocessing the data so that the data only had numeric datatypes, and then loaded the dataframe into cudf (cudf does not support string types yet apparently).

I also exported the post processed csv (one-hot encoded categorical variables), and read it back in via cudf, and it seems to work this time too.

I wonder, if there's any reasoning as to why 81 records works whereas 19 records does not for cudf . From a user standpoint, supporting strings and any type of dataset would be ideal.

mjsamoht commented 5 years ago

@larryzhang95 would you be able to share your input file FL_insurance_sample.csv, or any other file that reproduces the problem?

Also note that string support was added to cuDF in the latest 0.6 release.

larryzhang95 commented 5 years ago

The file is an open source dataset, I found a raw version of it at the following link: https://raw.githubusercontent.com/datacamp/learn-python-fundamentals/master/datasets/FL_insurance_sample.csv

I think the main issue is that there's a minimum number of columns required to import into a cudf DataFrame , at least in cudf 0.5, that are required. I'm not sure if there's an inherent reason for that.

To upgrade from 0.5 to 0.6 would I just use a conda upgrade/update command?

kkraus14 commented 5 years ago

The file is an open source dataset, I found a raw version of it at the following link: https://raw.githubusercontent.com/datacamp/learn-python-fundamentals/master/datasets/FL_insurance_sample.csv

I think the main issue is that there's a minimum number of columns required to import into a cudf DataFrame , at least in cudf 0.5, that are required. I'm not sure if there's an inherent reason for that.

To upgrade from 0.5 to 0.6 would I just use a conda upgrade/update command?

@larryzhang95 we are working on releasing 0.6 now so hopefully in the next few days we'll have a conda 0.6 package that you can update to with conda update cudf=0.6*

mjsamoht commented 5 years ago

Synced to 0.7 top of tree as of 3/28/19.

import cudf
insurance_sample = cudf.read_csv('FL_insurance_sample.csv',delimiter=',')

The file FL_insurance_sample.csv is only 4MB but it takes ~10 minutes on my system before read_csv terminates with:

ERROR:  8  in read_csv: no data available for data type inference
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tmeier/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/cudf-0.7.0.dev0+93.ge0acdc61.dirty-py3.7-linux-x86_64.egg/cudf/io/csv.py", line 337, in read_csv
    libgdf.read_csv(csv_reader)
  File "/home/tmeier/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/libgdf_cffi/wrapper.py", line 27, in wrap
    raise GDFError(errname, msg)
libgdf_cffi.wrapper.GDFError: GDF_INVALID_API_CALL
mjsamoht commented 5 years ago

The problem is with the fileFL_insurance_sample.csv. Lines are terminated with '\r', whereas the default terminator is '\n'.

You need to specify: insurance_sample = cudf.read_csv('FL_insurance_sample.csv',delimiter=',',lineterminator='\r')

mjsamoht commented 5 years ago

This also explains the error message seen by @larryzhang95 that the number of records is too small. Because lines are terminated with '\r' the entire file is essentially one very long row.

larryzhang95 commented 5 years ago

That makes sense. It also makes sense when I used pandas to one-hot encode . I outputted the file using .to_csv() , and then read it via cudf. the to_csv() function in pandas defaults seperator=','; which is well handled in cudf

larryzhang95 commented 5 years ago

I wonder if this will be handled down the line (of course lower priority).

This is what I see being done in pandas:

sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

mjsamoht commented 5 years ago

Note that sep is the delimiter for separating columns, not the line terminator. If we were to support auto detect it wouldn't help with the problem in this issue.