pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

Add `.cast` as DataFrame method #9959

Closed ion-elgreco closed 9 months ago

ion-elgreco commented 10 months ago

Problem description

It would be useful to add .cast as a method on the dataframe class which takes a dictionary as input with full or partial schema and their polars dtypes.

Example:

schema = {"foo": pl.Uint64, "bar": pl.List}

df = df.cast(schema)
stinodego commented 10 months ago

The implementation for this would be simple:

df = df.with_columns(pl.col(c).cast(dtype) for c, dtype in schema.items())

I'm not 100% convinced yet it's worth adding this as a DataFrame method.

ion-elgreco commented 10 months ago

The implementation for this would be simple:

df = df.with_columns(pl.col(c).cast(dtype) for c, dtype in schema.items())

I'm not 100% convinced yet it's worth adding this as a DataFrame method.

It's rather some syntactic sugar and makes it slightly more readable.

PyArrow also has the ability to do this on table level: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.cast

mcrumiller commented 10 months ago

We already have this for pl.struct which works much the same way:

import polars as pl

df = pl.DataFrame({
    'a': pl.Series([1, 2, 3], dtype=pl.Int32),
    'b': pl.Series([1.0, 2.0, 3.0], dtype=pl.Float64)
})

df2 = df.select(
    pl.struct(schema={'a': pl.UInt8, 'b': pl.UInt16})
).unnest('a')

print(df.dtypes)
print(df2.dtypes)
[Int32, Float64]
[UInt8, UInt16]

I like the idea of being able to apply a new schema onto an existing DataFrame, and I can see a lot of reasons where it could be useful: namely, where multiple processes are supplying you the same data, but may not know the intended schema. For example, you might get UInt64 from some places and Int32 from others, or pl.Datetime and pl.Date, etc. Having a simple df.cast(schema) using a predefined schema would make this code very clean.

Alternatively we could have a df.apply_schema or some such, but I think cast is obvious enough.

mkleinbort-ic commented 10 months ago

I have that implementation myself, I call it astype because that's the name I'm used to:

df.astype({'date':pl.Date(), 'userId':pl.Int64()})

This is very useful when converting back and forth from pandas to polars, where the data type specifics can get lost.

    # My implementation is a bit messy because it handles dates and datetimes to some extent

    def _astype(df:pl.DataFrame, dtype_dict:dict[str,pl.DataType])->pl.DataFrame:      

        current_types = df.select(pl.col(list(dtype_dict.keys()))).schema

        type_tuples = {k: (current_types[k], dtype_dict[k]) for k in dtype_dict.keys()}

        conversion_dict = {col: pl.col(col).cast(dtype) 

                                if (dtype not in [pl.Date, pl.Datetime] or (cdtype != pl.Utf8())) else 

                                pl.col(col).str.strptime(dtype)     

                        for col, (cdtype, dtype) in type_tuples.items()}

        return df.with_columns(**conversion_dict)
stinodego commented 10 months ago

Right, I think this makes sense. Could exist on both DataFrame and LazyFrame. I like the name cast for this.

I guess we would accept as input a dictionary of name -> dtype, and partial schemas would also be accepted.

Sample test:

from datetime import date, datetime

import polars as pl
from polars.testing import assert_frame_equal

def test_df_cast() -> None:
    df = pl.DataFrame(
        {"a": [1, 2], "b": [3.0, 4.0], "c": [date(2022, 1, 1), date(2022, 1, 2)]}
    )
    schema = {"a": pl.Int8, "c": pl.Datetime}

    result = df.cast(schema)

    expected = pl.DataFrame(
        {
            "a": [1, 2],
            "b": [3.0, 4.0],
            "c": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
        },
        schema_overrides=schema,
    )
    assert_frame_equal(result, expected)
mkleinbort-ic commented 10 months ago

I like this, though note that support for pl.UTF8 -> Date/Datetime would be a very good to have - a lot of APIs return dates as strings.

def test_df_cast() -> None:
    df = pl.DataFrame({
          "a": [1, 2], 
          "b": [3.0, 4.0], 
          "c": [date(2022, 1, 1), date(2022, 1, 2)], 
          "d": ['2022-01-01', '2022-01-02']
        })

    schema = {"a": pl.Int8, "c": pl.Datetime, "d": pl.Date}

    result = df.cast(schema)

    expected = pl.DataFrame(
        {
            "a": [1, 2],
            "b": [3.0, 4.0],
            "c": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
            "d": [pl.date(2022, 1, 1), pl.date(2022, 1, 2)],
        },
        schema_overrides=schema,
    )
    assert_frame_equal(result, expected)
stinodego commented 10 months ago

DataFrame.cast should not do anything that Expr.cast does not also do. We can consider expanding Expr.cast to be able to cast strings to dates, but that's a different topic altogether.

mkleinbort-ic commented 10 months ago

That makes sense to me.

mcrumiller commented 10 months ago

The strings to dates has been discussed a lot. I think @ritchie46 is pretty strongly against it. I think supporting ISO8601 strings as dates would make using dates in polars way easier.

alexander-beedie commented 10 months ago

I like this, though note that support for pl.UTF8 -> Date/Datetime would be a very good to have - a lot of APIs return dates as strings.

The problem there is there is no one string format to cast from - it's not really a cast, it's a parse/translation (eg: strptime) with all kinds of corner cases, ambiguous formats, etc.

The strings to dates has been discussed a lot. I think @ritchie46 is pretty strongly against it. I think supporting ISO8601 strings as dates would make using dates in polars way easier.

I'm with @ritchie46 on this one... it sounds reasonable to start with, but it usually leads to a bad place and/or unnecessarily complex internals and "guess what I mean" functions, when the answer is just "if you want to pass in dates, pass in dates" :)

mcrumiller commented 10 months ago

but it usually leads to a bad place and/or unnecessarily complex internals

If you require ISO8601 format, there is zero ambiguity and it's fairly easy. It does introduce a lot of if str when dealing with dates but it's not that bad, and it's such a common usage that I think we should support it. But I won't die on this hill.

stinodego commented 10 months ago

Please move that discussion to the new issue opened by @mkleinbort-ic

zundertj commented 10 months ago

Given some of the use cases of this, like casting to a schema you already have from another dataframe, could we also add the ability to pass in a dataframe, i.e.

df1.cast(df2)

as short-hand for

df1.cast(df2.schema)

?

I guess the only question is what you do if the column names dont match, but if we are flexible on doing a "partial cast" with passing a schema as a dict, it could be argued the same should apply for a dataframe input? That is, keep the same columns as the original, but only cast dtypes.

mcrumiller commented 10 months ago

could we also add the ability to pass in a dataframe

@zundertj that's a good idea. I would recommend using the like keyword, as this is often used in other libraries when you want one thing to look like another.

df1.cast(like=df2)
stinodego commented 10 months ago

I don't like that at all. Just be explicit and pass the schema.

df1.cast(like=df2) isn't any shorter or more readable than df1.cast(df2.schema).

Let's not make things more complicated than they need to be.

itsakettle commented 10 months ago

Hi there, could I have a go at this one?

stinodego commented 10 months ago

Sure, looking forward to the PR!

itsakettle commented 9 months ago

Hi again. First contribution for me, so I've been going through the relevant code bit by bit and making good progress I think.

In terms of this change I'm wondering is

def cast(self, new_dtypes: dict[str, pl.DataType]) -> DataFrame

more descriptive than

def cast(self, schema: dict[str, pl.DataType]) -> DataFrame

If schema is the argument name it might give the impression that we are applying a new schema to the df which, for example, could look something like this:

>>> df = pl.DataFrame(
        ...     {
        ...         "apple": [1, 2, 3],
        ...         "banana": [6, 7, 8],
        ...         "orange": ["a", "b", "c"],
        ...     }
        ... )
>>> df
shape: (3, 3)

│ apple ┆ banana ┆ orange │
│ i64   ┆ i64    ┆ str    │
│ ---   ┆ ---    ┆ ---    │
│ 1     ┆ 6      ┆ a      │
│ 2     ┆ 7      ┆ b      │
│ 3     ┆ 8      ┆ c      │

>>> schema = {"banana": pl.String, "pear": pl.String}
>>> df.cast(schema)
shape: (3, 3)

│ banana┆ pear   │
│ str   ┆ str    │
│ ---   ┆ ---    │
│ 6     ┆ null   │
│ 7     ┆ null   │
│ 8     ┆ null   │

But this isn't what we're proposing that df.cast will do. It will look more like this (per tests above):

>>> new_dtypes = {"banana": pl.String, "pear": pl.String}
>>> df.cast(new_dtypes)
shape: (3, 3)

│ apple ┆ banana ┆ orange │
│ i64   ┆ str    ┆ str    │
│ ---   ┆ ---    ┆ ---    │
│ 1     ┆ 6      ┆ a      │
│ 2     ┆ 7      ┆ b      │
│ 3     ┆ 8      ┆ c      │

So here are my questions:

Hope that makes sense, thanks!

mkleinbort-ic commented 9 months ago

Thoughts:

Does using def cast(self, new_dtypes: dict[str, pl.DataType]) -> DataFrame as the signature sound ok?

I don't love the argument name new_dtypes - I've never come across it and this feels unfamiliar. The pandas .astype method uses the parameter dtype and so does the polars pl.Expr.cast method

See:

I think the best implementation would be: def cast(self, dtype: pl.DataType | dict[str, pl.DataType]) -> DataFrame Where if the argument is a dictionary it applies to the columns as specified. If the argument is just a pl.DataType then it applies to all columns.

If there are columns that don't exist in new_dtypes do you think df.cast should error or just ignore them? (pear in the second example above is ignored)

Not all columns in df need to be referenced in the .cast input, but all columns referenced in the .cast MUST exist in df.

If multiple columns in new_dtypes cause expr.cast to throw errors is it preferred to present these all to the user in one go so they can fix them in one go? Or should we do the easier thing and show them the first error, let them fix it, then show them the second and so on?

Either is fine, but I think one at a time will result in simpler internal code - otherwise you might find yourself pooling errors, etc... There are also perverse cases where the errors would be supper annoying (imagine a 1,000 column dataframe that results in 1,000 identical errors).

pl.DataFrame({'x1': ['A'], 'x2': ['B'], ..., 'x1000':['ALL']}).cast(pl.Int64)

In general I'd be afraid of any code that could result in arbitrarily long error messages.

alexander-beedie commented 9 months ago

The pandas .astype method uses the parameter "dtype" and so does the polars pl.Expr.cast method

I'd make a slight amendment here; the Expr version can only cast from one dtype to another - this frame-level method will potentially cast to/from n dtypes, so I'd advocate keeping the parameter name both familiar (similar to the expression-level cast) but also accurate by pluralising it as dtypes. (Looks pretty odd to declare a singular dtype as a dict that can contain lots of dtypes). Can make a good case for either that or schema; I'm relatively neutral as to which ;)

mkleinbort-ic commented 9 months ago

@alexander-beedie - in your view should "convert all to a given type" inputs be supported:

Eg.

df = pl.DataFrame({'x1': [1], 'x2':[2.0], 'x3':['3']})
df.cast(pl.Float64) # Equivalent to df.select(pl.all().cast(pl.Float65))
mcrumiller commented 9 months ago

@mkleinbort-ic I think yes, but continuing to use the parameter name dtypes works. The dtypes of all the columns will be "pl.Float65", whatever that is. I'd recommend using pl.Float64 instead though :)

mkleinbort-ic commented 9 months ago

It's a secret Datatypo that makes the computation happen in the network card

mkleinbort-ic commented 9 months ago

pl.Float66 removes the Jedi overhead

alexander-beedie commented 9 months ago

@mkleinbort-ic I think yes, but continuing to use the parameter name dtypes works. The dtypes of all the columns will be "pl.Float65", whatever that is. I'd recommend using pl.Float64 instead though :)

Float65 goes to 11 - what if you need that little extra? ;)

(But more seriously, yes, the param name dtypes still works in this case; I also think that it's a convenience shortcut worth supporting).

alexander-beedie commented 9 months ago

Landing a PR tomorrow - just finished a clean implementation (including selector support), but still need to write some tests and it's late here. Will finish up tomorrow morning.

Examples:

itsakettle commented 9 months ago

Ah I was too slow. I'll be quicker next time! Away with family and had less time than I thought.

alexander-beedie commented 9 months ago

Ah I was too slow. I'll be quicker next time! Away with family and had less time than I thought.

Apologies! I joined in the discussion a bit late and somehow completely missed you offering to work on this one...😅

itsakettle commented 9 months ago

No worries, I'm in it for the learning and looking at your PR will give me lots of that.

alexander-beedie commented 9 months ago

No worries, I'm in it for the learning and looking at your PR will give me lots of that.

Apparently I learned a few things too while refining it, haha... :))