pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
415 stars 36 forks source link

The `schema` argument of `as_polars_df()` is needed? #897

Open eitsupi opened 3 months ago

eitsupi commented 3 months ago

In my opinion, there is no user demand to change column names in as_polars_df(). So simply removing the schema argument and leaving only schema_override would be sufficient.

In terms of type change, the schema argument is more difficult to use than schema_override in that all columns must be specified.

_Originally posted by @eitsupi in https://github.com/pola-rs/r-polars/pull/896#discussion_r1512853370_

eitsupi commented 3 months ago

@etiennebacher What do you think about this? I was thinking of deprecating it, but I feel like we could remove it since probably very few people use this.

etiennebacher commented 3 months ago

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame() and that allows piping. If that's the case I don't think we should remove functionalities that are in pl$DataFrame().

eitsupi commented 3 months ago

IIUC the point of as_polars_df() is simply to have a function that is easier to find than pl$DataFrame()

I do not understand your position. This is like data.frame() and as.data.frame(), I think the purpose of the function is different. (And, pl$DataFrame() can be used after |>.)

I don't think we should remove functionalities that are in pl$DataFrame().

I mistakenly thought that you were the one who suggested the removal of this. https://github.com/pola-rs/r-polars/pull/896#discussion_r1512755195 Are you suggesting that this does not need to be removed?

etiennebacher commented 3 months ago

I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)

The problem there is that the behavior of schema when column names in schema do not match column names in the data is different between pl$DataFrame() and as_polars_df():

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String, y = pl$Int32))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: Some columns in `schema` are not in the DataFrame.

This is consistent with py-polars:

import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String, "y": pl.Int32},
)

ValueError: the given column-schema names do not match the data dictionary
library(polars)
options(polars.do_not_repeat_call = TRUE)

as_polars_df(data.frame(a = 1:3, b = 4:6), schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b   ┆ y   │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘

IMO this is not a good behavior and it should rather throw an error like in pl$DataFrame().

So supposing this behavior is fixed, I don't think there's a reason to remove schema from as_polars_df().

etiennebacher commented 3 months ago

By the way, the argument schema in pl$DataFrame() should also be fixed as it is equivalent to schema_overrides in py-polars for now (i.e also accepts partial schema).

This should error since not all column names are specified in schema:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ a   ┆ b   │
#> │ --- ┆ --- │
#> │ i32 ┆ str │
#> ╞═════╪═════╡
#> │ 1   ┆ 4   │
#> │ 2   ┆ 5   │
#> │ 3   ┆ 6   │
#> └─────┴─────┘
import polars as pl

pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
    },
    schema={"b": pl.String}
)

ValueError: the given column-schema names do not match the data dictionary
eitsupi commented 3 months ago

This is consistent with py-polars:

I feel that is inconsistent with the behavior described in the documentation.

It says:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

https://github.com/pola-rs/polars/blob/4bc67a0d0f6c9a113fd6b231d0d9638e58407156/py-polars/polars/dataframe/frame.py#L214-L216

In other words, from_arrow is doing the right thing here, and the current DataFrame behavior may be a bug.

etiennebacher commented 3 months ago

I think this is already tracked here: https://github.com/pola-rs/polars/issues/14386

eitsupi commented 3 months ago

Good point.

But in any case I have no idea when to use the schema argument. Is there any reason why you can't just use schema_overrides?

etiennebacher commented 3 months ago

In my mind schema was a stricter version of schema_overrides where you needed to specify the type for all columns. But that was before you showed me this part:

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

which I think doesn't make sense. I think the current state is quite confusing, so that's also why I think we should wait for them to clarify (hopefully quite quickly).

eitsupi commented 3 months ago

That makes sense.

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

etiennebacher commented 3 months ago

If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().

If so, yes. If the expected behavior is to only specify the column types, without renaming, then I don't think it should be removed.