rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.35k stars 889 forks source link

[FEA] add something like an is_valid_integer function #7080

Closed revans2 closed 3 years ago

revans2 commented 3 years ago

Is your feature request related to a problem? Please describe. In Spark we want to be able to check if casting a string to an INT64, INT32, INT16 or INT8 matches the same pattern as Spark and will not overflow. Up until now we have been using regular expressions, but they are very very slow (slower than parsing the data out of parquet https://github.com/NVIDIA/spark-rapids/issues/1432) we have tried to use is_integer and is_float to speed this up and we see a lot of potential in speeding this up.

Describe the solution you'd like I would love to see an is_valid_integer function added that is similar to is_integer. The difference would be

  1. It would take an allow_decimal boolean value (or possibly a format enum) to say if we want to check that the format is [+-]?[0-9]+ like is_integer or if we should check that it matches [+-]?[0-9]+(.[0-9]+)? similar to is_float but without some of the special cases for float (E, Inf, -Inf, NaN)
  2. It would take some kind of a parameter so we could also do overflow checking.

Describe alternatives you've considered I don't know of another good way to make this work. We could write this ourselves, but it feels like it is something that others could use too. We also could keep using regular expressions, but it is very slow and does not give us a way to check for overflow.

Additional context This is very similar to #5110 and if we had this we could drop that.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

sameerz commented 3 years ago

@ttnghia can you take a look at this, since you are also taking a look at issue https://github.com/rapidsai/cudf/issues/5110 ?

ttnghia commented 3 years ago

Sure. This is supposed to be closed by PR#7094. I'll check that out.