pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
398 stars 35 forks source link

feat: export/import DataFrame as raw vector #1072

Closed eitsupi closed 2 weeks ago

eitsupi commented 2 weeks ago

Functions originally implemented on the Rust side for use in inter-process communication are refactored and made available on the R side. This is useful when implementing asynchronous processing in R (use with the mirai package for example).

I am considering what function should be used on the R side.

Python Polars uses write_ipc and read_ipc to switch between writing to the session and writing to the file depending on the first argument, but I feel that it is better to have the same function for reading and a separate function for writing, as in the R arrow package.

shikokuchuo commented 2 weeks ago

In general I prefer separate functions as you've implemented. Otherwise a separate 'file' argument if writing a file, then just need a check for missing() rather than doing costly inferral.

This kind of thing makes sense for generics, but when the arguments are different types, I find it cleaner not to overload.

eitsupi commented 2 weeks ago

@shikokuchuo Thank you for your comment.

I agree with you, but I think there are cases here where it is better to focus on consistency with other language APIs in Polars and with other packages in R.

  1. In Python Polars, polars.read_ipc() and others change their behavior depending on the argument type.
  2. Popular read functions in R, such as data.table::fread(), readr::read_csv(), and arrow::read_ipc(), have the ability to interpret vectors directly as files in addition to file paths.

In other words, in both Polars and R, it seems that it is acceptable to have different behavior for read functions depending on the argument type (although these functions are not generic functions, of course).

So I am thinking here that the function to convert a raw vector of Arrow IPC to a DataFrame should be used by rewriting pl$read_ipc().

shikokuchuo commented 2 weeks ago

That's fine. My comment is just 'in general' - if there are more important considerations such as API consistency as you point out, don't let me stop you!

eitsupi commented 2 weeks ago

It seems working fine with mirai 1.0.0 (released few hours ago!).

This example is from https://shikokuchuo.net/mirai/articles/databases.html#database-hosting---using-arrow-database-connectivity

library(mirai)

daemons(1)
#> [1] 1

everywhere({
  library(DBI)
  con <<- dbConnect(adbi::adbi("adbcsqlite"), uri = ":memory:")
})

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(future = TRUE),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

m <- mirai(dbWriteTableArrow(con, "iris", iris))
call_mirai(m)$data
#> [1] TRUE

m <- mirai(dbReadTableArrow(con, "iris"))
call_mirai(m)$data
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
#> │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
#> │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
#> │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
#> │ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ …            ┆ …           ┆ …            ┆ …           ┆ …         │
#> │ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
#> │ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
#> │ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
#> │ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
#> │ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

Created on 2024-05-03 with reprex v2.1.0

shikokuchuo commented 2 weeks ago

Great! I can confirm that the above works.

For me this also works:

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

I didn't find the documentation for what that future = TRUE argument means.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I think once nanoarrow implements its own serialization features, it should also behave like this i.e. not require function(x) ...

eitsupi commented 2 weeks ago

I didn't find the documentation for what that future = TRUE argument means.

Polars introduced the StringView type earlier than other Arrow implementations and uses it as the default string type, which can cause problems when passing Arrow IPC to other Arrow implementations. For example, nanoarrow does not yet implement the StringView type yet (apache/arrow-nanoarrow#367).

Also, the arrow package does not support converting this to R as of version 15, so errors occur when loading it.

df <- polars::pl$DataFrame(string = "foo")

df$to_raw_ipc(future = FALSE) |>
  arrow::read_ipc_file()
#>   string
#> 1    foo

df$to_raw_ipc(future = TRUE) |>
  arrow::read_ipc_file()
#> Error: cannot handle Array of type <utf8_view>

Created on 2024-05-03 with reprex v2.1.0

This is not a problem for exchanging data between Polars and should result in a slight performance increase due to the lack of extra conversions.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I too think something like that would be worth adding, but there is no consensus yet. Here is a recent discussion. etiennebacher/tidypolars#111

shikokuchuo commented 2 weeks ago

Thanks for exposing these anyway - I think they will be useful for your users if they want to work with parallel / distributed computing. I will add something to the mirai docs early next week.

Or once this is merged / released actually :)

eitsupi commented 2 weeks ago

I will merge this for now. If function names need to be changed, I believe they can be changed later.