Closed laundmo closed 2 years ago
I can confirm this is still an issue on v0.14.2.
I can confirm this is still an issue on v0.14.2.
i accidentally got the version from the wrong interpreter - was testing in a venv but got the version from my system install. oops. updated now.
In from_dict
it is up to the caller to ensure all types in a column are of a given dtype. This has been discussed before and is something I consider good data hygiene.
I will take a look at the others.
Are you absolutely dead-set on that descision? Loading data from JSON for example will often lead to mixed ints and floats, requiring the user to manually adjust this before loading into Polars seems really unintuitive to me, especially since Polars already has the capability built in.
For large datasets loaded from APIs this might actually be a considerable effort, having to convert the dtypes manually, and i think @parafoxia can speak to how unituitive this is to a new Polars user - and how much boilerplate code it causes just to make sure Polars can get the correct datatype.
The core issue is that you're trying to enforce data hygiene on the user by making Polars less useful - why?
I would really recommend re-evaluating that descision.
(Adding here this issue has spawned from an issue I ran into with a specific use case. The code is subject to another issue, #4490, if you want to see it.)
YouTube only provides its data in a JSON, so I've not got much choice.
I can understand automatic conversions between strings and floats, for example, not being a thing. However, I personally think mixed integers and floats should really be handled. As my use case shows, it's really quite awkward to convert it manually, and it's not exactly like I can just go to YouTube and tell them to fix their (admittedly very jank) API.
The infer schema kwarg to certain methods very much suggests that this is handled, at least to me, and the fact that it isn't is undesired behaviour. If all elements in a column must be the same dtype, why scan 50 rows by default?
I'm more enquiring than anything, as I'm a bit confused as to why the behaviour is as it is, especially from a usability standpoint.
For the record, that particular use case has been addressed by passing an already formed Arrow table to Polars. Arrow handles this use case automatically, so I do find it a little interesting that Polars does not, considering its built on top of Arrow (from my understanding at least).
For the record, that particular use case has been addressed by passing an already formed Arrow table to Polars. Arrow handles this use case automatically, so I do find it a little interesting that Polars does not, considering its built on top of Arrow (from my understanding at least).
Polars uses the arrow specification. We come with our own + arrow2's implementation of the arrow spec. Every implementation of the arrow spec (which is a data format), can do completely different things with regard to its API.
I can understand automatic conversions between strings and floats, for example, not being a thing. However, I personally think mixed integers and floats should really be handled. As my use case shows, it's really quite awkward to convert it manually, and it's not exactly like I can just go to YouTube and tell them to fix their (admittedly very jank) API.
I understand. I will see if we can make that easier.
The infer schema kwarg to certain methods very much suggests that this is handled, at least to me, and the fact that it isn't is undesired behaviour. If all elements in a column must be the same dtype, why scan 50 rows by default?
Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.
Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.
but the CSV method is one of the only ones which supports the schema_inference_length - while others don't. Yes, scanning 20GB is a lot of work, but if you look at the solutions i proposed i never said to scan everything - only to expose the methods to set up the scanning, so that it can be adjusted to the usecase like with CSVs. That, plus defaults which go beyond just the first row, is all I'm asking for.
Polars uses the arrow specification. We come with our own + arrow2's implementation of the arrow spec. Every implementation of the arrow spec (which is a data format), can do completely different things with regard to its API.
I see, that makes sense. I'm new to Polars so I'm still understanding how it all works.
I understand. I will see if we can make that easier.
Thank you, I appreciate that! If there's any way I can be of help through the thought process, feel free to hit me up -- I'd be happy to help. I appreciate a need to maintain good data hygiene standards is important, and should definitely be considered.
Because the scanning is expensive. We don't want to scan a 20GB csv file only to determine the schema we must use.
My point was less this, and more that, from behaviour I've observed, it tends to only scan the first row. Even when I converted the data to a format with_dicts
wanted, setting the schema inference to 50 didn't change anything, even though the first float row in that column was row 9.
I should add: this is the full error for pl.from_dicts
, the only one which includes infer_schema_length with a default of 50, but it still fails for even just 2 rows.
>>> df = pl.from_dicts([{"a": 1, "b": 2}, {"a": 3.1, "b": 4.5}])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "site-packages\polars\convert.py", line 109, in from_dicts
return DataFrame._from_dicts(dicts, infer_schema_length)
File "site-packages\polars\internals\dataframe\frame.py", line 334, in _from_dicts
pydf = PyDataFrame.read_dicts(data, infer_schema_length)
exceptions.ComputeError: Could not append Float64(3.1) to builder; make sure that all rows have the same schema.
Or consider increasing the the 'schema_inference_length' argument.
I should add: this is the full error for pl.from_dicts, the only one which includes infer_schema_length with a default of 50, but it still fails for even just 2 rows.
Yeah, currently it searches for the first no null in the N
rows given. But we should update that to determine the super type of the first N rows.
That PR doesn't seem to allow custom inference lengths from python - are you sure this should be closed or should i open a new one?
Could add those. 👍
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of polars.
Issue Description
When creating a DataFrame from a python type which only includes floats after the first, polars fails to infer the schema. There is also no way to increase the schema_inference_lenght or use the other schema inference kwargs which are present for read_csv.
This is the case for:
pl.DataFrame
pl.from_records
pl.from_dict
pl.from_dicts
Reproducible Example
These raise either
TypeError: 'float' object cannot be interpreted as an integer
orexceptions.ComputeError: Could not append Float64(3.1) to builder; make sure that all rows have the same schema.
Expected Behavior
Common simple cases should parse correctly without the user having to do anything. For that i recommend taking all of these steps:
Installed Versions