jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 223 forks source link

JSON parser has issues with small f64 vales using scientific notation #1426

Closed DeliciousHair closed 1 year ago

DeliciousHair commented 1 year ago

I am cross referencing my original polars issue https://github.com/pola-rs/polars/issues/7312 as some digging around trying to find the source has led to finding this.

In https://github.com/jorgecarleitao/arrow2/blob/main/src/io/json/read/infer_schema.rs#L112-L117 it would seem that if the source json contains (small) floating point values like 2e-06, then this will cause problems as it attempts to set the dtype as Int64 and failure occurs. This problem only seems to occur for values of the form 2e-06 as opposed to 2.0e-06 as the presence of the decimal point keeps things legit.

popsu commented 1 year ago

Seems that the issue is possibly in the underlying crate used json-deserializer. Which actually has a test case that looks weird, it's basically asserting 1e-10 should be Number::Integer(...): https://github.com/jorgecarleitao/json-deserializer/blob/003cada719a3123ada049c11099c10d006c1e9df/tests/it/main.rs#L198-L211

So either

DeliciousHair commented 1 year ago

Will the sign check be enough though? That is, an IEEE 754 floating point is valid to 10^308 or something, which is larger than a 64 bit integer.

DeliciousHair commented 1 year ago

Actually, fixing this in json-deserializer would be the way to go IMHO as the above pattern, while effective, is very counterintuitive as 2e-03 would not be considered an integer in any system I'm aware of, while 1e234 makes perfect sense as an integer. Whether or not it actually fits in the consumers dtype is a problem for them to worry about.

I've worked out a fix in json-deserializer that fixes this issue and the solution actually propagates to fix the original polars issue as well, so I'll create a PR in that repo shortly.

DeliciousHair commented 1 year ago

fixed via https://github.com/jorgecarleitao/json-deserializer/issues/20