Open eitsupi opened 3 months ago
@etiennebacher What do you think about this? I was thinking of deprecating it, but I feel like we could remove it since probably very few people use this.
IIUC the point of as_polars_df()
is simply to have a function that is easier to find than pl$DataFrame()
and that allows piping. If that's the case I don't think we should remove functionalities that are in pl$DataFrame()
.
IIUC the point of
as_polars_df()
is simply to have a function that is easier to find thanpl$DataFrame()
I do not understand your position. This is like data.frame()
and as.data.frame()
, I think the purpose of the function is different.
(And, pl$DataFrame()
can be used after |>
.)
I don't think we should remove functionalities that are in
pl$DataFrame()
.
I mistakenly thought that you were the one who suggested the removal of this. https://github.com/pola-rs/r-polars/pull/896#discussion_r1512755195 Are you suggesting that this does not need to be removed?
I mistakenly thought that you were the one who suggested the removal of this. #896 (comment)
The problem there is that the behavior of schema
when column names in schema
do not match column names in the data is different between pl$DataFrame()
and as_polars_df()
:
pl$DataFrame()
it errors:library(polars)
options(polars.do_not_repeat_call = TRUE)
pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String, y = pl$Int32))
#> Error: Execution halted with the following contexts
#> 0: In R: in $DataFrame():
#> 1: Some columns in `schema` are not in the DataFrame.
This is consistent with py-polars:
import polars as pl
pl.DataFrame(
{
"a": [1, 2, 3],
"b": [4, 5, 6],
},
schema={"b": pl.String, "y": pl.Int32},
)
ValueError: the given column-schema names do not match the data dictionary
as_polars_df()
it automatically renames by position:library(polars)
options(polars.do_not_repeat_call = TRUE)
as_polars_df(data.frame(a = 1:3, b = 4:6), schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b ┆ y │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1 ┆ 4 │
#> │ 2 ┆ 5 │
#> │ 3 ┆ 6 │
#> └─────┴─────┘
IMO this is not a good behavior and it should rather throw an error like in pl$DataFrame()
.
So supposing this behavior is fixed, I don't think there's a reason to remove schema
from as_polars_df()
.
By the way, the argument schema
in pl$DataFrame()
should also be fixed as it is equivalent to schema_overrides
in py-polars for now (i.e also accepts partial schema).
This should error since not all column names are specified in schema
:
library(polars)
options(polars.do_not_repeat_call = TRUE)
pl$DataFrame(a = 1:3, b = 4:6, schema = list(b = pl$String))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ a ┆ b │
#> │ --- ┆ --- │
#> │ i32 ┆ str │
#> ╞═════╪═════╡
#> │ 1 ┆ 4 │
#> │ 2 ┆ 5 │
#> │ 3 ┆ 6 │
#> └─────┴─────┘
import polars as pl
pl.DataFrame(
{
"a": [1, 2, 3],
"b": [4, 5, 6],
},
schema={"b": pl.String}
)
ValueError: the given column-schema names do not match the data dictionary
This is consistent with py-polars:
I feel that is inconsistent with the behavior described in the documentation.
It says:
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
In other words, from_arrow
is doing the right thing here, and the current DataFrame
behavior may be a bug.
I think this is already tracked here: https://github.com/pola-rs/polars/issues/14386
Good point.
But in any case I have no idea when to use the schema
argument.
Is there any reason why you can't just use schema_overrides
?
In my mind schema
was a stricter version of schema_overrides
where you needed to specify the type for all columns. But that was before you showed me this part:
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
which I think doesn't make sense. I think the current state is quite confusing, so that's also why I think we should wait for them to clarify (hopefully quite quickly).
That makes sense.
If the current behavior of polars.from_arrow()
in Python is as intended, I vote to remove the schema
argument from as_polars_df()
.
If the current behavior of polars.from_arrow() in Python is as intended, I vote to remove the schema argument from as_polars_df().
If so, yes. If the expected behavior is to only specify the column types, without renaming, then I don't think it should be removed.
In my opinion, there is no user demand to change column names in
as_polars_df()
. So simply removing theschema
argument and leaving onlyschema_override
would be sufficient.In terms of type change, the
schema
argument is more difficult to use thanschema_override
in that all columns must be specified._Originally posted by @eitsupi in https://github.com/pola-rs/r-polars/pull/896#discussion_r1512853370_