Closed galipremsagar closed 4 years ago
I've done a little digging into the reader's type inference functions and it seems int64 is the only supported integer type ATM. I suppose it attempts to use an int64 for it and an overflow occurred.
https://github.com/rapidsai/cudf/blob/67c203435ec64762265061b0c63dca33153aee82/cpp/src/io/csv/reader_impl.cu#L590 https://github.com/rapidsai/cudf/blob/67c203435ec64762265061b0c63dca33153aee82/cpp/src/io/csv/reader_impl.cu#L591
cc: @jrhemstad @harrism
Looks like this shares the root cause with https://github.com/rapidsai/cudf/issues/6314 @kaatish , should merging https://github.com/rapidsai/cudf/pull/6446 close this issue too?
Looks like this shares the root cause with https://github.com/rapidsai/cudf/issues/6314 @kaatish , should merging https://github.com/rapidsai/cudf/pull/6446 close this issue too?
Yes. It does look like the same problem. Getting PR 6446 merged should fix this issue.
Describe the bug When there are unsigned types(say
uint64
) in a csv, pandas reads the series correctly but in cudf the dtype of the series being inferred toint64
thus leading to data corruption after loading a csv file. A smaller version of csv file is attached here generated in fuzz test: short-data.csv.zipSteps/Code to reproduce bug
Expected behavior
cudf.read_csv
should be able to read data correctly and infer unsigned types as well.Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Surfaced while running fuzz tests #6001