Closed DavisVaughan closed 12 months ago
Sounds plausible! I can spend some time to double check the reasoning more thoroughly if you think that'd be helpful here.
@lionel- I'm going to try and run cloud revdeps with r-devel, if that looks good then I don't think I need a second review (unless you want to), but thanks!
We don't have an easy way to do r-devel revdep checks, but I did an r 4.3.1 revdepcheck across vctrs, dplyr, and tidyr and did not see any issues. The two failures were ggmap related false positives that I think are related to downloading things.
I have to imagine that if this was going to cause issues, then it would also show up on released versions of R with other kinds of non-list ALTREP objects, so I'm optimistic that this is working as expected
## revdepcheck results
We checked 4313 reverse dependencies (4310 from CRAN + 3 from Bioconductor), comparing R CMD check results across CRAN and dev versions of this package.
* We saw 2 new problems
* We failed to check 6 packages
Issues with CRAN packages are summarised below.
### New problems
(This reports the first line of each new failure)
* ggquiver
checking examples ... ERROR
* SWMPrExtension
checking examples ... ERROR
### Failed to check
* bayesdfa (NA)
* loon.ggplot (NA)
* loon.shiny (NA)
* TriDimRegression (NA)
* triptych (NA)
* vivid (NA)
In https://github.com/r-lib/vctrs/pull/1151, I reverted us to a simpler ownership model.
However, I added in an idea where we unconditionally shallow duplicate ALTREP objects before we try to assign to them. I left a few comments about this in the description and in the code
I now believe I was mistaken about how ALTREP worked. In particular, the idea that we need to duplicate ALTREP objects before dereferencing them (with something like
INTEGER()
) is just wrong:Rf_shallow_duplicate()
doesn't guarantee that you get a non-ALTREP object. In particular, shallow duplication is actually a nice way to generate wrapper ALTREP objects. I also think vroom ALTREP objects can be duplicated and return another ALTREP object.DATAPTR()
should be completely safe. It forces ALTREP materialization but is then required to give you a pointer to "actual" memory representing that type. And I'm not worried about the "force materialization" idea here being expensive - since we "own" the data in the path where we would force materialization at assignment time (i.e. fromvec_init()
), the only kind of ALTREP objects we should encounter are cheap-to-materialize wrapper ALTREP objects like you get from proxying the init-ed object, like withvec_proxy.vctrs_list_of()
callingunclass()
to drop off the class attribute.So that is a description of why I don't think we need to duplicate ALTREP objects unconditionally, but why has this come up all of a sudden? In R-devel (4.4.0), ALTREP list vectors were added. That has caused this memory related test to start failing: https://github.com/r-lib/vctrs/blob/8bbd8c4a69a9b3e2c42aa752c5339f949562af96/tests/testthat/test-c.R#L578-L587
With a reprex of:
Indeed this runs much slower and allocates MUCH more memory in R-devel. It is a little complicated due to the tibble in the mix, but I can try to explain it.
make_list_of()
allocates a list of very small tibbles containing list-ofs:Note that
4e3 == 4,000
sowith_memory_prof(list_unchop(make_list_of(4e3)))
allocates a list of 4000 of these tibbles and then binds them together usinglist_unchop()
. To do this:vec_init()
allocates the result data frame containing the list-of column and recursively proxies it (so the list-of column is proxied too)vec_proxy.vctrs_list_of()
is called, which only doesunclass(x)
, but this makes an ALTREP wrapper version of the list-of column in R-devel (as long as the object has "enough" elements, and 4k is enough).vec_c()
. On the first iteration we dig intodf_assign()
, extract the proxy ALTREP list-of column we are going to assign to, and then go intovec_proxy_assign_opts()
to assign into itvec_proxy_assign_opts(<proxy-altrep-list-of>, 0, <first-list-of>)
we try to assign the first list-of element into theproxy
by going throughlist_assign()
. But this guards the assignment withvec_clone_referenced(proxy, owned)
. Sinceproxy
is an ALTREP list-of, this GETS SHALLOW DUPLICATED and a fresh ALTREP wrapper is made, and then atSET_VECTOR_ELT()
time a full duplication is made so we can assign to it, even though we fully own it!proxy
returned from that iteration ofvec_proxy_assign_opts()
is also an ALTREP wrapper, we go through this on every iteration, making 4k duplicates ofproxy
in total.In this PR, we change this cycle by not doing a shallow duplication in
vec_clone_referenced()
because we own theproxy
ALTREP list-of. When we callSET_VECTOR_ELT()
the first time, this may do 1 full duplication on the first iteration (not sure) to materialize the wrapper ALTREP's data, but after that it is treated like a non-ALTREP object for every other iteration because data2 is filled out and we don't get this explosive duplication.