pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
424 stars 38 forks source link

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

Open etiennebacher opened 5 months ago

etiennebacher commented 5 months ago

Very annoying to have to specify NA_integer_ or the other variants:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(x = list(1L, 2L, NA_integer_))
#> shape: (3, 1)
#> ┌───────────┐
#> │ x         │
#> │ ---       │
#> │ list[i32] │
#> ╞═══════════╡
#> │ [1]       │
#> │ [2]       │
#> │ [null]    │
#> └───────────┘

# NA should be automatically cast to the correct type
pl$DataFrame(x = list(1L, 2L, NA))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: When constructing polars literal from Robj
#>    2: Encountered the following error in Rust-Polars:
#>          expected Series
etiennebacher commented 4 months ago

Smaller reprex:

pl$lit(list(1L, 2L, NA))
eitsupi commented 4 months ago

Perhaps a mechanism is needed to guess the type and cast it.

I don't know how type conversion is currently done between R and Polars. Related to #425 and #592

etiennebacher commented 4 months ago

There's already a mechanism actually, @sorhawell left this comment here:

https://github.com/pola-rs/r-polars/blob/4e9160a5a3a9f87fa756e7f92932a357340ec179/src/rust/src/conversion_r_to_s.rs#L288-L305

eitsupi commented 4 months ago

I noticed that the arrow package does not support this either.

> arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert

> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>>
eitsupi commented 2 months ago

I think using vctrs::list_of() instead of list() is simpler solution for now. I believe this function has been widely used in the tidyverse world for years.

eitsupi commented 3 weeks ago

I was able to make this work in an implementation I am rewriting from scratch using savvy.

> as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
        [1]
        [2]
        [null]
]

I think the implementation is far cleaner now that I've made it completely branch on the S3 method.

To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible.

as_polars_series.list <- function(x, name = NULL, ...) {
  series_list <- lapply(x, \(child) {
    if (is.null(child)) {
      NULL
    } else {
      as_polars_series(child)$`_s`
    }
  })

  PlRSeries$new_series_list(name %||% "", series_list) |>
    wrap()
}
    fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
        let series_vec: Vec<Option<Series>> = values
            .values_iter()
            .map(|value| match value.into_typed() {
                TypedSexp::Null(_) => None,
                TypedSexp::Environment(e) => {
                    let ptr = e.get(".ptr").unwrap().unwrap();
                    let r_series = <&PlRSeries>::try_from(ptr).unwrap();
                    Some(r_series.series.clone())
                }
                _ => panic!("Expected a list of Series"),
            })
            .collect();

        let dtype = series_vec
            .iter()
            .map(|s| {
                if let Some(s) = s {
                    s.dtype().clone()
                } else {
                    DataType::Null
                }
            })
            .reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
            .unwrap_or(DataType::Null);

        let casted_series_vec: Vec<Option<Series>> = series_vec
            .into_iter()
            .map(|s| {
                if let Some(s) = s {
                    Some(s.cast(&dtype).unwrap())
                } else {
                    None
                }
            })
            .collect();

        Ok(Series::new(name, casted_series_vec).into())
    }
eitsupi commented 2 weeks ago

Related change in Python Polars 1.0.0 pola-rs/polars#16939 Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

etiennebacher commented 2 weeks ago

Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of NA. Passing NA in a list should work automatically, just like it does when we pass a vector.

eitsupi commented 2 weeks ago

Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast.

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch. This has been fixed by vctrs (tidyverse/dplyr#6300), so we may be able to implement the same thing (cast NA) here by looking at the vctrs implementation.

etiennebacher commented 2 weeks ago

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.

Yes but that was far from ideal and raised confusion, which is why they changed the behavior with vctrs. One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

eitsupi commented 2 weeks ago

One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

If you want to do it on the R side, you probably need to ask for help from the vctrs package.

If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository). https://github.com/eitsupi/neo-r-polars/blob/e4f953ff1f5ea830c777af0d5211e144adeceedf/src/rust/src/series/construction.rs#L76-L97

My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

etiennebacher commented 2 weeks ago

I was thinking doing it on the Rust side.

if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

That's where the strict parameter is needed. To clarify, I do think this parameter is important, I just think it should ignore NA and therefore is unrelated to this issue. Here's the current behavior with the beta of py-polars 1.0:

>>> pl.Series([[1, 2, "a"]], strict=True) # error

>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
        ["1", "2", "a"]
]
eitsupi commented 2 weeks ago

I guess it depends on how far we extend the "ignore NA" thing. For example, is it only length 1, or is the length arbitrary and all elements are polars' null?

I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work?

> polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
   0: In R: in as_polars_series():
   0: During function call [polars::as_polars_series(list(character(), integer()))]
   1: Encountered the following error in Rust-Polars:
        When building series from R list; some parsed sub-elements did not match: One element was str and another was i32
eitsupi commented 2 weeks ago

In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null. is there any reason why you would want to use NA instead of NULL?

> polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        []
        [3.0]
]

Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable?

> as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        null
        [3.0]
]