Closed DeliciousHair closed 1 year ago
Seems that the issue is possibly in the underlying crate used json-deserializer
. Which actually has a test case that looks weird, it's basically asserting 1e-10
should be Number::Integer(...)
:
https://github.com/jorgecarleitao/json-deserializer/blob/003cada719a3123ada049c11099c10d006c1e9df/tests/it/main.rs#L198-L211
So either
json-deserializer
should change it to be parsed to Number::Float
ORarrow2
should check for the "exponent" whether it starts with zero. So change the function in https://github.com/jorgecarleitao/arrow2/blob/fa2c51c632dafecf1032a1778f00940e7858eace/src/io/json/read/infer_schema.rs#L112-L117
to something like this:
fn infer_number(n: &Number) -> DataType {
match n {
Number::Float(..) => DataType::Float64,
Number::Integer(_, b) => {
if !b.is_empty() && b[0] == b'-' {
DataType::Float64
} else {
DataType::Int64
}
}
}
Will the sign check be enough though? That is, an IEEE 754 floating point is valid to 10^308 or something, which is larger than a 64 bit integer.
Actually, fixing this in json-deserializer
would be the way to go IMHO as the above pattern, while effective, is very counterintuitive as 2e-03
would not be considered an integer in any system I'm aware of, while 1e234
makes perfect sense as an integer. Whether or not it actually fits in the consumers dtype is a problem for them to worry about.
I've worked out a fix in json-deserializer
that fixes this issue and the solution actually propagates to fix the original polars
issue as well, so I'll create a PR in that repo shortly.
I am cross referencing my original
polars
issue https://github.com/pola-rs/polars/issues/7312 as some digging around trying to find the source has led to finding this.In https://github.com/jorgecarleitao/arrow2/blob/main/src/io/json/read/infer_schema.rs#L112-L117 it would seem that if the source json contains (small) floating point values like
2e-06
, then this will cause problems as it attempts to set the dtype asInt64
and failure occurs. This problem only seems to occur for values of the form2e-06
as opposed to2.0e-06
as the presence of the decimal point keeps things legit.