Closed ion-elgreco closed 9 months ago
The implementation for this would be simple:
df = df.with_columns(pl.col(c).cast(dtype) for c, dtype in schema.items())
I'm not 100% convinced yet it's worth adding this as a DataFrame method.
The implementation for this would be simple:
df = df.with_columns(pl.col(c).cast(dtype) for c, dtype in schema.items())
I'm not 100% convinced yet it's worth adding this as a DataFrame method.
It's rather some syntactic sugar and makes it slightly more readable.
PyArrow also has the ability to do this on table level: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.cast
We already have this for pl.struct
which works much the same way:
import polars as pl
df = pl.DataFrame({
'a': pl.Series([1, 2, 3], dtype=pl.Int32),
'b': pl.Series([1.0, 2.0, 3.0], dtype=pl.Float64)
})
df2 = df.select(
pl.struct(schema={'a': pl.UInt8, 'b': pl.UInt16})
).unnest('a')
print(df.dtypes)
print(df2.dtypes)
[Int32, Float64]
[UInt8, UInt16]
I like the idea of being able to apply a new schema onto an existing DataFrame, and I can see a lot of reasons where it could be useful: namely, where multiple processes are supplying you the same data, but may not know the intended schema. For example, you might get UInt64
from some places and Int32
from others, or pl.Datetime
and pl.Date
, etc. Having a simple df.cast(schema)
using a predefined schema would make this code very clean.
Alternatively we could have a df.apply_schema
or some such, but I think cast
is obvious enough.
I have that implementation myself, I call it astype
because that's the name I'm used to:
df.astype({'date':pl.Date(), 'userId':pl.Int64()})
This is very useful when converting back and forth from pandas to polars, where the data type specifics can get lost.
# My implementation is a bit messy because it handles dates and datetimes to some extent
def _astype(df:pl.DataFrame, dtype_dict:dict[str,pl.DataType])->pl.DataFrame:
current_types = df.select(pl.col(list(dtype_dict.keys()))).schema
type_tuples = {k: (current_types[k], dtype_dict[k]) for k in dtype_dict.keys()}
conversion_dict = {col: pl.col(col).cast(dtype)
if (dtype not in [pl.Date, pl.Datetime] or (cdtype != pl.Utf8())) else
pl.col(col).str.strptime(dtype)
for col, (cdtype, dtype) in type_tuples.items()}
return df.with_columns(**conversion_dict)
Right, I think this makes sense. Could exist on both DataFrame
and LazyFrame
. I like the name cast
for this.
I guess we would accept as input a dictionary of name -> dtype
, and partial schemas would also be accepted.
Sample test:
from datetime import date, datetime
import polars as pl
from polars.testing import assert_frame_equal
def test_df_cast() -> None:
df = pl.DataFrame(
{"a": [1, 2], "b": [3.0, 4.0], "c": [date(2022, 1, 1), date(2022, 1, 2)]}
)
schema = {"a": pl.Int8, "c": pl.Datetime}
result = df.cast(schema)
expected = pl.DataFrame(
{
"a": [1, 2],
"b": [3.0, 4.0],
"c": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
},
schema_overrides=schema,
)
assert_frame_equal(result, expected)
I like this, though note that support for pl.UTF8 -> Date/Datetime would be a very good to have - a lot of APIs return dates as strings.
def test_df_cast() -> None:
df = pl.DataFrame({
"a": [1, 2],
"b": [3.0, 4.0],
"c": [date(2022, 1, 1), date(2022, 1, 2)],
"d": ['2022-01-01', '2022-01-02']
})
schema = {"a": pl.Int8, "c": pl.Datetime, "d": pl.Date}
result = df.cast(schema)
expected = pl.DataFrame(
{
"a": [1, 2],
"b": [3.0, 4.0],
"c": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
"d": [pl.date(2022, 1, 1), pl.date(2022, 1, 2)],
},
schema_overrides=schema,
)
assert_frame_equal(result, expected)
DataFrame.cast
should not do anything that Expr.cast
does not also do. We can consider expanding Expr.cast
to be able to cast strings to dates, but that's a different topic altogether.
That makes sense to me.
The strings to dates has been discussed a lot. I think @ritchie46 is pretty strongly against it. I think supporting ISO8601 strings as dates would make using dates in polars way easier.
I like this, though note that support for pl.UTF8 -> Date/Datetime would be a very good to have - a lot of APIs return dates as strings.
The problem there is there is no one string format to cast from - it's not really a cast, it's a parse/translation (eg: strptime
) with all kinds of corner cases, ambiguous formats, etc.
The strings to dates has been discussed a lot. I think @ritchie46 is pretty strongly against it. I think supporting ISO8601 strings as dates would make using dates in polars way easier.
I'm with @ritchie46 on this one... it sounds reasonable to start with, but it usually leads to a bad place and/or unnecessarily complex internals and "guess what I mean" functions, when the answer is just "if you want to pass in dates, pass in dates" :)
but it usually leads to a bad place and/or unnecessarily complex internals
If you require ISO8601 format, there is zero ambiguity and it's fairly easy. It does introduce a lot of if str
when dealing with dates but it's not that bad, and it's such a common usage that I think we should support it. But I won't die on this hill.
Please move that discussion to the new issue opened by @mkleinbort-ic
Given some of the use cases of this, like casting to a schema you already have from another dataframe, could we also add the ability to pass in a dataframe, i.e.
df1.cast(df2)
as short-hand for
df1.cast(df2.schema)
?
I guess the only question is what you do if the column names dont match, but if we are flexible on doing a "partial cast" with passing a schema as a dict, it could be argued the same should apply for a dataframe input? That is, keep the same columns as the original, but only cast dtypes.
could we also add the ability to pass in a dataframe
@zundertj that's a good idea. I would recommend using the like
keyword, as this is often used in other libraries when you want one thing to look like another.
df1.cast(like=df2)
I don't like that at all. Just be explicit and pass the schema.
df1.cast(like=df2)
isn't any shorter or more readable than df1.cast(df2.schema)
.
Let's not make things more complicated than they need to be.
Hi there, could I have a go at this one?
Sure, looking forward to the PR!
Hi again. First contribution for me, so I've been going through the relevant code bit by bit and making good progress I think.
In terms of this change I'm wondering is
def cast(self, new_dtypes: dict[str, pl.DataType]) -> DataFrame
more descriptive than
def cast(self, schema: dict[str, pl.DataType]) -> DataFrame
If schema
is the argument name it might give the impression that we are applying a new schema to the df which, for example, could look something like this:
>>> df = pl.DataFrame(
... {
... "apple": [1, 2, 3],
... "banana": [6, 7, 8],
... "orange": ["a", "b", "c"],
... }
... )
>>> df
shape: (3, 3)
│ apple ┆ banana ┆ orange │
│ i64 ┆ i64 ┆ str │
│ --- ┆ --- ┆ --- │
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
>>> schema = {"banana": pl.String, "pear": pl.String}
>>> df.cast(schema)
shape: (3, 3)
│ banana┆ pear │
│ str ┆ str │
│ --- ┆ --- │
│ 6 ┆ null │
│ 7 ┆ null │
│ 8 ┆ null │
But this isn't what we're proposing that df.cast will do. It will look more like this (per tests above):
>>> new_dtypes = {"banana": pl.String, "pear": pl.String}
>>> df.cast(new_dtypes)
shape: (3, 3)
│ apple ┆ banana ┆ orange │
│ i64 ┆ str ┆ str │
│ --- ┆ --- ┆ --- │
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
So here are my questions:
def cast(self, new_dtypes: dict[str, pl.DataType]) -> DataFrame
as the signature sound ok?new_dtypes
do you think df.cast should error or just ignore them? (pear
in the second example above is ignored)new_dtypes
cause expr.cast
to throw errors is it preferred to present these all to the user in one go so they can fix them in one go? Or should we do the easier thing and show them the first error, let them fix it, then show them the second and so on?Hope that makes sense, thanks!
Thoughts:
Does using
def cast(self, new_dtypes: dict[str, pl.DataType]) -> DataFrame
as the signature sound ok?
I don't love the argument name new_dtypes
- I've never come across it and this feels unfamiliar.
The pandas .astype
method uses the parameter dtype
and so does the polars pl.Expr.cast
method
See:
I think the best implementation would be:
def cast(self, dtype: pl.DataType | dict[str, pl.DataType]) -> DataFrame
Where if the argument is a dictionary it applies to the columns as specified. If the argument is just a pl.DataType
then it applies to all columns.
If there are columns that don't exist in
new_dtypes
do you thinkdf.cast
should error or just ignore them? (pear in the second example above is ignored)
Not all columns in df
need to be referenced in the .cast
input, but all columns referenced in the .cast
MUST exist in df
.
If multiple columns in
new_dtypes
causeexpr.cast
to throw errors is it preferred to present these all to the user in one go so they can fix them in one go? Or should we do the easier thing and show them the first error, let them fix it, then show them the second and so on?
Either is fine, but I think one at a time will result in simpler internal code - otherwise you might find yourself pooling errors, etc... There are also perverse cases where the errors would be supper annoying (imagine a 1,000 column dataframe that results in 1,000 identical errors).
pl.DataFrame({'x1': ['A'], 'x2': ['B'], ..., 'x1000':['ALL']}).cast(pl.Int64)
In general I'd be afraid of any code that could result in arbitrarily long error messages.
The pandas .astype method uses the parameter "dtype" and so does the polars pl.Expr.cast method
I'd make a slight amendment here; the Expr
version can only cast from one dtype to another - this frame-level method will potentially cast to/from n
dtypes, so I'd advocate keeping the parameter name both familiar (similar to the expression-level cast
) but also accurate by pluralising it as dtypes
. (Looks pretty odd to declare a singular dtype as a dict that can contain lots of dtypes). Can make a good case for either that or schema
; I'm relatively neutral as to which ;)
@alexander-beedie - in your view should "convert all to a given type" inputs be supported:
Eg.
df = pl.DataFrame({'x1': [1], 'x2':[2.0], 'x3':['3']})
df.cast(pl.Float64) # Equivalent to df.select(pl.all().cast(pl.Float65))
@mkleinbort-ic I think yes, but continuing to use the parameter name dtypes
works. The dtypes of all the columns will be "pl.Float65
", whatever that is. I'd recommend using pl.Float64
instead though :)
It's a secret Datatypo that makes the computation happen in the network card
pl.Float66 removes the Jedi overhead
@mkleinbort-ic I think yes, but continuing to use the parameter name
dtypes
works. The dtypes of all the columns will be "pl.Float65
", whatever that is. I'd recommend usingpl.Float64
instead though :)
Float65 goes to 11 - what if you need that little extra? ;)
(But more seriously, yes, the param name dtypes
still works in this case; I also think that it's a convenience shortcut worth supporting).
Landing a PR tomorrow - just finished a clean implementation (including selector
support), but still need to write some tests and it's late here. Will finish up tomorrow morning.
Examples:
df.cast( {"foo": pl.Float32, "bar": pl.UInt8} )
df.cast( {cs.numeric(): pl.UInt32})
df.cast( pl.Utf8 )
Ah I was too slow. I'll be quicker next time! Away with family and had less time than I thought.
Ah I was too slow. I'll be quicker next time! Away with family and had less time than I thought.
Apologies! I joined in the discussion a bit late and somehow completely missed you offering to work on this one...😅
No worries, I'm in it for the learning and looking at your PR will give me lots of that.
No worries, I'm in it for the learning and looking at your PR will give me lots of that.
Apparently I learned a few things too while refining it, haha... :))
Problem description
It would be useful to add .cast as a method on the dataframe class which takes a dictionary as input with full or partial schema and their polars dtypes.
Example: