pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.39k stars 1.97k forks source link

Infer schema when using pl.DataFrame and from_ methods #4489

Closed laundmo closed 2 years ago

laundmo commented 2 years ago

Polars version checks

Issue Description

When creating a DataFrame from a python type which only includes floats after the first, polars fails to infer the schema. There is also no way to increase the schema_inference_lenght or use the other schema inference kwargs which are present for read_csv.

This is the case for: pl.DataFrame pl.from_records pl.from_dict pl.from_dicts

Reproducible Example

import polars as pl

df = pl.DataFrame({"vals": [0, 0.1]})

df = pl.from_dicts([{"a": 1, "b": 2}, {"a": 3.1, "b": 4.5}])

df = pl.from_dict({"a": [1, 2.1, 3], "b": [4, 5, 6.4]})

df = pl.from_records([[1, 2.1, 3], [4, 5, 6.4]])

These raise either TypeError: 'float' object cannot be interpreted as an integer or exceptions.ComputeError: Could not append Float64(3.1) to builder; make sure that all rows have the same schema.

Expected Behavior

Common simple cases should parse correctly without the user having to do anything. For that i recommend taking all of these steps:

  1. using more than 1 row to infer schema by default.
  2. exposing the schema inference parameters on these functions.
  3. inferring the schema based on first and last few types by default.

Installed Versions

``` ---Version info--- Polars: 0.14.2 Index type: UInt32 Platform: Windows-10-10.0.19044-SP0 Python: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] ---Optional dependencies--- pyarrow: pandas: numpy: fsspec: connectorx: xlsx2csv: ```
parafoxia commented 2 years ago

I can confirm this is still an issue on v0.14.2.

laundmo commented 2 years ago

I can confirm this is still an issue on v0.14.2.

i accidentally got the version from the wrong interpreter - was testing in a venv but got the version from my system install. oops. updated now.

ritchie46 commented 2 years ago

In from_dict it is up to the caller to ensure all types in a column are of a given dtype. This has been discussed before and is something I consider good data hygiene.

I will take a look at the others.

laundmo commented 2 years ago

Are you absolutely dead-set on that descision? Loading data from JSON for example will often lead to mixed ints and floats, requiring the user to manually adjust this before loading into Polars seems really unintuitive to me, especially since Polars already has the capability built in.

For large datasets loaded from APIs this might actually be a considerable effort, having to convert the dtypes manually, and i think @parafoxia can speak to how unituitive this is to a new Polars user - and how much boilerplate code it causes just to make sure Polars can get the correct datatype.

The core issue is that you're trying to enforce data hygiene on the user by making Polars less useful - why?

I would really recommend re-evaluating that descision.

parafoxia commented 2 years ago

(Adding here this issue has spawned from an issue I ran into with a specific use case. The code is subject to another issue, #4490, if you want to see it.)

YouTube only provides its data in a JSON, so I've not got much choice.

I can understand automatic conversions between strings and floats, for example, not being a thing. However, I personally think mixed integers and floats should really be handled. As my use case shows, it's really quite awkward to convert it manually, and it's not exactly like I can just go to YouTube and tell them to fix their (admittedly very jank) API.

The infer schema kwarg to certain methods very much suggests that this is handled, at least to me, and the fact that it isn't is undesired behaviour. If all elements in a column must be the same dtype, why scan 50 rows by default?

I'm more enquiring than anything, as I'm a bit confused as to why the behaviour is as it is, especially from a usability standpoint.

parafoxia commented 2 years ago

For the record, that particular use case has been addressed by passing an already formed Arrow table to Polars. Arrow handles this use case automatically, so I do find it a little interesting that Polars does not, considering its built on top of Arrow (from my understanding at least).

ritchie46 commented 2 years ago

For the record, that particular use case has been addressed by passing an already formed Arrow table to Polars. Arrow handles this use case automatically, so I do find it a little interesting that Polars does not, considering its built on top of Arrow (from my understanding at least).

Polars uses the arrow specification. We come with our own + arrow2's implementation of the arrow spec. Every implementation of the arrow spec (which is a data format), can do completely different things with regard to its API.

I can understand automatic conversions between strings and floats, for example, not being a thing. However, I personally think mixed integers and floats should really be handled. As my use case shows, it's really quite awkward to convert it manually, and it's not exactly like I can just go to YouTube and tell them to fix their (admittedly very jank) API.

I understand. I will see if we can make that easier.

The infer schema kwarg to certain methods very much suggests that this is handled, at least to me, and the fact that it isn't is undesired behaviour. If all elements in a column must be the same dtype, why scan 50 rows by default?

Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.

laundmo commented 2 years ago

Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.

but the CSV method is one of the only ones which supports the schema_inference_length - while others don't. Yes, scanning 20GB is a lot of work, but if you look at the solutions i proposed i never said to scan everything - only to expose the methods to set up the scanning, so that it can be adjusted to the usecase like with CSVs. That, plus defaults which go beyond just the first row, is all I'm asking for.

parafoxia commented 2 years ago

Polars uses the arrow specification. We come with our own + arrow2's implementation of the arrow spec. Every implementation of the arrow spec (which is a data format), can do completely different things with regard to its API.

I see, that makes sense. I'm new to Polars so I'm still understanding how it all works.

I understand. I will see if we can make that easier.

Thank you, I appreciate that! If there's any way I can be of help through the thought process, feel free to hit me up -- I'd be happy to help. I appreciate a need to maintain good data hygiene standards is important, and should definitely be considered.

Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.

My point was less this, and more that, from behaviour I've observed, it tends to only scan the first row. Even when I converted the data to a format with_dicts wanted, setting the schema inference to 50 didn't change anything, even though the first float row in that column was row 9.

laundmo commented 2 years ago

I should add: this is the full error for pl.from_dicts, the only one which includes infer_schema_length with a default of 50, but it still fails for even just 2 rows.

>>> df = pl.from_dicts([{"a": 1, "b": 2}, {"a": 3.1, "b": 4.5}])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "site-packages\polars\convert.py", line 109, in from_dicts
    return DataFrame._from_dicts(dicts, infer_schema_length)
  File "site-packages\polars\internals\dataframe\frame.py", line 334, in _from_dicts
    pydf = PyDataFrame.read_dicts(data, infer_schema_length)
exceptions.ComputeError: Could not append Float64(3.1) to builder; make sure that all rows have the same schema.
Or consider increasing the the 'schema_inference_length' argument.
ritchie46 commented 2 years ago

I should add: this is the full error for pl.from_dicts, the only one which includes infer_schema_length with a default of 50, but it still fails for even just 2 rows.

Yeah, currently it searches for the first no null in the N rows given. But we should update that to determine the super type of the first N rows.

laundmo commented 2 years ago

That PR doesn't seem to allow custom inference lengths from python - are you sure this should be closed or should i open a new one?

ritchie46 commented 2 years ago

Could add those. 👍