Closed krlmlr closed 3 months ago
@krlmlr I've introduced our own version of df_n_col()
and blocked usage of base ncol()
to try and prevent this from happening in the future.
We do still need to call base::ncol()
on a matrix in a single case (that's why your commit had test failures), so I also added in mat_n_col()
as well to allow us to use that on matrices as required.
For future us:
ncol()
was problematic because it is implemented as dim(x)[1]
, but calling dim()
requires knowing the number of rows (retrieved through row names info, see below), which duckplyr does not know until it fully materializes the query:
> dim.data.frame
function (x)
c(.row_names_info(x, 2L), length(x))
duckplyr does know the length of the data frame though, so we can utilize that.
df_n_col()
asserts that if x
inherits from "data.frame"
, then it must be built on a VECSXP
where the length of that VECSXP
corresponds to the number of columns in the data frame. This is the same assertion/assumption that vctrs makes.
Note that dim()
is generic so in theory data frame subclasses could do weird things with it, but I think the assertion above is strong enough that that should not matter. (and if a subclass did make the number of columns returned by dim()
inconsistent with the underlying VECSXP length, it would likely result in breaking all of the vctrs algorithms that work on data frames, so I feel good about this)
Thanks for taking this on, Davis. I love the "for future us" bit!
This leads to premature materialization in
union_all()
and others with duckplyr 0.4.0 . I'll have to work around by vendoring everything that usesncol()
directly or indirectly into duckdb.