pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.83k stars 1.7k forks source link

ComputeError: could not append value when creating Polars DataFrame #17246

Open MTisMT opened 1 week ago

MTisMT commented 1 week ago

Checks

Reproducible example

test.json

response = requests.get('https://github.com/user-attachments/files/16026717/test.json')
res = json.loads(response.text)
# Creating DataFrame using Pandas (works fine)
df_pd = pd.DataFrame(res)
# Creating DataFrame using Polars (raises the error)
df = pl.DataFrame(res) 

Log output

ComputeError: could not append value: 1.41431 of type: f64 to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length` it might also be that a value overflows the data-type's capacity

Issue description

I'm encountering a ComputeError when trying to create a Polars DataFrame from a data. It seems Polars has a problem with the value 1.41431. Pandas works fine in creating a dataframe from the same data, but I do not want to use pandas since it is slow. I've also noticed that a similar error was discussed in a closed GitHub issue for Polars, but I am still encountering this error despite using the latest version of Polars (0.20.31). I checked the types of the values in res and they seem to be consistent. I also tried other things like rounding floating-point numbers to 5 decimal places or increasing infer_schema_length, but they didn't work and I got the same error again. I also asked my question on Stackoverflow. Has anyone faced a similar issue or have any insights on how to resolve this?

Expected behavior

I expected the pl.DataFrame(res) line to work similarly to the pd.DataFrame(res) line, but it raises the mentioned error.

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-5.4.0-170-generic-x86_64-with-glibc2.36 Python: 3.10.14 (main, Jun 13 2024, 06:49:33) [GCC 12.2.0]```
stinodego commented 1 week ago

What's in your res variable? Please create a reproducible example - without it, we cannot do much.

I can say that you should probably be using read_database or read_database_uri for this, rather than reading stuff into a cursor and plugging it into the DataFrame constructor.

MTisMT commented 1 week ago

I created the reproducible example in test.json and updated the issue. The error is still there.

MTisMT commented 1 week ago

The problem was solved using by setting infer_schema_length to None, thanks to jqurious

pl.DataFrame(res, infer_schema_length=None)

It seems the error was raised because the default value of infer_schema_length is 100. However, in our data, the type of that column can not be detected until a row is bigger than 100.

MTisMT commented 1 week ago

The error description only suggests increasing the infer_schema_length. Now this error description can be modified: make sure that all rows have the same schema or consider increasing infer_schema_length or setting infer_schema_length to None. it might also be that a value overflows the data-type's capacity