Open andygrove opened 2 years ago
@davidwendt @ttnghia Do you think there is a bug in our device stod
function in convert_floats.cu
? I also noticed the bugfix in this function from PR #7410.
@vuule @nvdbaranec I suppose there also could be a bug in our parse_numeric
function in parsing_utils.cuh
.
My thinking is that we should use the device stod
function from cpp/src/strings
in cuIO, and also make sure the device stod
is solid. What do you guys think?
That's because we have different string-to-float parsers: One is from convert_floats.cu
and one from parsing_utils.cuh
. Maybe they are both correct, but they work differently and produce different results. "different" here is just due to floating-point round-off.
Making both of them producing exactly the same result would be difficult, because they are just different. Unless we can refactor them, extracting a shared function that both of them will just call for parsing string-to-float.
Even if we hack/fix one of them to produce the same result this time, it is not guaranteed to not produce different results the next time. The only way to completely solve such inconsistency is to use one utility function called from both places.
Even if we hack/fix one of them to produce the same result this time, it is not guaranteed to not produce different results the next time. The only way to completely solve such inconsistency is to use one utility function called from both places.
Yes, ideally we would have a single implementation. But, if we aligned them and added unit tests, would that not be sufficient to keep the consistency over time?
I would prefer to have a matching device version of std::stod
and use that everywhere. Does NVIDIA already have another implementation of this parser somewhere?
I would prefer to have a matching device version of
std::stod
and use that everywhere.
Agreed, though I'm not sure if a standard implementation of std::stod
would work because it doesn't understand unicode characters. @davidwendt I assume this would be a problem? Or do these functions assume to be working on ASCI characters?
Does NVIDIA already have another implementation of this parser somewhere?
Nope. Doing this right would mean porting the <string>
header in libcu++, which would likely be a huge lift, but extremely worthwhile.
The std::stod
depends mostly on ASCII for the digits and scientific notation, etc. Unfortunately, the standard implementation of std::stod
is technically locale dependent. This is one of the many problems with porting the <string>
device functions -- the locale classes and data have to be ported as well.
One thing to be aware of is that printing a floating point number involves converting it back to a string in base10 which is also inaccurate. So what is printed does not necessarily reflect how the bits are set in the actual float or double. That said, I will try to investigate why the bits appear to be different in the two implementations.
I think the only difference in capability is that the cuIO version supports hexadecimal values. Good news is that the number base is a template parameter, so stod
and parse_numeric
are equivalent in some instantiations.
FWIW, we can reorganize how the calling cuIO code is SFINAEd and use stod
when parsing decimal floating point values.
That gives us the desired alignment, unless parsing hexfloats (it's a thing).
Just realized that the cuIO version also allows for different decimal point characters and thousand separators :( We can't meaningfully switch to stod
without changing it.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Invalid floating pointer number has different behavior in stod
and parse_numeric<float>
.
in stod
, an invalid floating pointer number, returns a valid number until it becomes invalid. (use in string column convert to floats)
in parse_numeric<float>
, it returns NAN error_result = std::numeric_limits<T>::quiet_NaN()
(used in cuio)
Initially I modified stod
to take care of invalid characters similar to parse_numeric
. All tests passed. But io parsing is 20% slower, convert_floats is 6-10% slower
After @davidwendt suggestion, I updated parse_numeric
as is_float + stod
to just take care of extra decimal and thousands characters, similarly is_float(). All tests passed. But io 30% slower, convert_floats no slowdown
Computed difference for few testcases among pandas.read_csv, io (parse_numeric/cudf.read_csv), cudf_stod (cudf.to_numeric), and Cython std::stod.
[ i] pdf-stl_stod io-stl_stod cudf_stod-stl_stod string
[ 0] 0.0 0.0 0.0 0.1
[ 1] 0.0 0.0 0.0 0.2
[ 2] 0.0 5.551115123125783e-17 0.0 0.3
[ 3] -8.326672684688674e-17 0.0 0.0 0.123456789012345678901234567890123456780123456780
[ 4] -1e-21 1.88079096131566e-37 0.0 0.000000000000000000001
[ 5] 0.0 2.220446049250313e-16 0.0 1.4503599627370496999
[ 6] 0.0 -512.0 0.0 4503599627370496999
Thank you @karthikeyann for running these test cases. I think the floating point errors you showed above are within the expected range. 20% slower parsing for floats is a concern but not a deal-breaker. Is there no impact to parsing for non-float types? Would you please share the microbenchmark results?
Have you considered @vuule's comments above about decimal and thousands separators?
Is there no impact to parsing for non-float types? Would you please share the microbenchmark results?
Integral types are affected. The most impact is in float types.
./_deps/benchmark-src/tools/compare.py benchmarks OLD/CSV_READER_BENCH.json NEW/CSV_READER_BENCH.json
Comparing OLD/CSV_READER_BENCH.json to NEW/CSV_READER_BENCH.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
CsvRead/integral_file_input/29/0/manual_time +0.1396 +0.1361 227 258 227 258
CsvRead/integral_buffer_input/29/1/manual_time +0.1888 +0.1862 196 233 195 232
CsvRead/floats_file_input/31/0/manual_time +0.2058 +0.2091 261 315 260 315
CsvRead/floats_buffer_input/31/1/manual_time +0.2132 +0.2089 229 277 229 277
CsvRead/decimal_file_input/35/0/manual_time +0.0259 +0.0245 149 153 149 153
CsvRead/decimal_buffer_input/35/1/manual_time +0.0347 +0.0347 118 123 118 122
CsvRead/timestamps_file_input/33/0/manual_time +0.1058 +0.1030 501 554 500 551
CsvRead/timestamps_buffer_input/33/1/manual_time +0.0875 +0.0853 441 479 441 478
CsvRead/durations_file_input/34/0/manual_time +0.0823 +0.0777 635 687 635 684
CsvRead/durations_buffer_input/34/1/manual_time +0.0890 +0.0890 559 609 559 609
CsvRead/string_file_input/23/0/manual_time +0.0006 +0.0083 149 149 148 149
CsvRead/string_buffer_input/23/1/manual_time +0.0110 +0.0147 125 126 124 126
CsvRead/column_selection/0/0/1/manual_time +0.0872 +0.0872 288 313 288 313
CsvRead/column_selection/1/0/1/manual_time +0.1099 +0.1114 207 230 206 229
CsvRead/column_selection/2/0/1/manual_time +0.1147 +0.1146 209 233 209 233
CsvRead/column_selection/3/0/1/manual_time +0.1056 +0.1083 211 234 211 234
CsvRead/row_selection/0/1/1/manual_time +0.0894 +0.0894 287 312 287 312
CsvRead/row_selection/0/2/1/manual_time +0.0862 +0.0932 289 314 287 314
CsvRead/row_selection/0/3/1/manual_time +0.0914 +0.0868 287 313 287 312
CsvRead/row_selection/0/1/8/manual_time +0.1829 +0.1284 268 317 266 300
CsvRead/row_selection/0/2/8/manual_time +0.0735 +0.0691 621 666 620 663
CsvRead/row_selection/0/3/8/manual_time +0.0475 +0.0535 960 1006 952 1002
./_deps/benchmark-src/tools/compare.py benchmarks OLD/STRINGS_BENCH.json NEW/STRINGS_BENCH.json
StringsToNumeric/strings_to_float32/1024/manual_time +0.0613 +0.0381 26 27 45 46
StringsToNumeric/strings_to_float32/4096/manual_time +0.0576 +0.0370 26 27 45 46
StringsToNumeric/strings_to_float32/16384/manual_time +0.0464 +0.0283 28 29 46 48
StringsToNumeric/strings_to_float32/65536/manual_time +0.0634 +0.0442 30 32 48 51
StringsToNumeric/strings_to_float32/131072/manual_time +0.1007 +0.0672 34 38 53 56
StringsToNumeric/strings_to_float64/1024/manual_time +0.0561 +0.0326 26 27 45 46
StringsToNumeric/strings_to_float64/4096/manual_time +0.0464 +0.0278 27 28 46 47
StringsToNumeric/strings_to_float64/16384/manual_time +0.0543 +0.0335 28 29 46 48
StringsToNumeric/strings_to_float64/65536/manual_time +0.0564 +0.0353 30 32 49 51
StringsToNumeric/strings_to_float64/131072/manual_time +0.0957 +0.0610 35 38 54 57
Have you considered @vuule's comments above about decimal and thousands separators?
Yes. I updated stod
to take care of user given decimal and thousands separators. All the unit tests passed before running benchmarks.
Thanks @karthikeyann for sharing this benchmark update. Looking into the integral and timestamp columns, are you using is_float + stod
, and do you think the is_integer + string_to_integer_fn
in convert_integers.cu could be better?
It seems weird that we would get a 30 ms slowdown in CsvRead
when we are converting from StringsToNumeric
in < 50 us per 100K elements.
Semi-tangential, but the casting work I'm doing right now involves a stod() function that is much more parallel. Instead of one-thread-per-string, it's one-warp-per-string.
@GregoryKimball integers are not affected by accuracy concerns. So, it does not need any change.
Semi-tangential, but the casting work I'm doing right now involves a stod() function that is much more parallel. Instead of one-thread-per-string, it's one-warp-per-string.
Great idea. memory coalescing on string will impact speedup. This could be useful for convert_float (cudf.to_numeric), but not for cuio code (Could be possible with shared memory).
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Describe the bug The parsing of strings to floats has different results in different contexts, as discussed in https://github.com/NVIDIA/spark-rapids/issues/5035.
Parsing floats from CSV
Given the following input file:
The following code parses the floats as follows.
Parsing floats with to_numeric
Parsing the same values using
to_numeric
produces different results.Steps/Code to reproduce bug See above code examples.
Expected behavior These two paths should produce consistent results.
Environment overview (please complete the following information) Bare metal (desktop PC).
Environment details
Click here to see environment details
Additional context See https://github.com/NVIDIA/spark-rapids/issues/5035