tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.79k stars 2.12k forks source link

Implement own `df_n_col()` to avoid calling `ncol()` (and `dim()`) #7049

Closed krlmlr closed 3 months ago

krlmlr commented 4 months ago

This leads to premature materialization in union_all() and others with duckplyr 0.4.0 . I'll have to work around by vendoring everything that uses ncol() directly or indirectly into duckdb.

DavisVaughan commented 4 months ago

@krlmlr I've introduced our own version of df_n_col() and blocked usage of base ncol() to try and prevent this from happening in the future.

We do still need to call base::ncol() on a matrix in a single case (that's why your commit had test failures), so I also added in mat_n_col() as well to allow us to use that on matrices as required.


For future us:

ncol() was problematic because it is implemented as dim(x)[1], but calling dim() requires knowing the number of rows (retrieved through row names info, see below), which duckplyr does not know until it fully materializes the query:

> dim.data.frame
function (x) 
c(.row_names_info(x, 2L), length(x))

duckplyr does know the length of the data frame though, so we can utilize that.

df_n_col() asserts that if x inherits from "data.frame", then it must be built on a VECSXP where the length of that VECSXP corresponds to the number of columns in the data frame. This is the same assertion/assumption that vctrs makes.

Note that dim() is generic so in theory data frame subclasses could do weird things with it, but I think the assertion above is strong enough that that should not matter. (and if a subclass did make the number of columns returned by dim() inconsistent with the underlying VECSXP length, it would likely result in breaking all of the vctrs algorithms that work on data frames, so I feel good about this)

krlmlr commented 4 months ago

Thanks for taking this on, Davis. I love the "for future us" bit!