Closed revans2 closed 3 years ago
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
@ttnghia can you take a look at this, since you are also taking a look at issue https://github.com/rapidsai/cudf/issues/5110 ?
Sure. This is supposed to be closed by PR#7094. I'll check that out.
Is your feature request related to a problem? Please describe. In Spark we want to be able to check if casting a string to an INT64, INT32, INT16 or INT8 matches the same pattern as Spark and will not overflow. Up until now we have been using regular expressions, but they are very very slow (slower than parsing the data out of parquet https://github.com/NVIDIA/spark-rapids/issues/1432) we have tried to use
is_integer
andis_float
to speed this up and we see a lot of potential in speeding this up.Describe the solution you'd like I would love to see an
is_valid_integer
function added that is similar tois_integer
. The difference would beallow_decimal
boolean value (or possibly a format enum) to say if we want to check that the format is[+-]?[0-9]+
likeis_integer
or if we should check that it matches[+-]?[0-9]+(.[0-9]+)?
similar tois_float
but without some of the special cases for float (E, Inf, -Inf, NaN)Describe alternatives you've considered I don't know of another good way to make this work. We could write this ourselves, but it feels like it is something that others could use too. We also could keep using regular expressions, but it is very slow and does not give us a way to check for overflow.
Additional context This is very similar to #5110 and if we had this we could drop that.