Closed lionel- closed 4 years ago
Merging #38 into master will increase coverage by
0.31%
. The diff coverage is96.00%
.
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
+ Coverage 93.99% 94.30% +0.31%
==========================================
Files 8 10 +2
Lines 333 369 +36
==========================================
+ Hits 313 348 +35
- Misses 20 21 +1
Impacted Files | Coverage Δ | |
---|---|---|
R/misc.R | 82.92% <ø> (-0.95%) |
:arrow_down: |
R/tidyverse.R | 95.12% <95.12%> (ø) |
|
R/init.R | 100.00% <100.00%> (ø) |
|
R/ops.R | 100.00% <100.00%> (ø) |
|
R/utils.R | 100.00% <100.00%> (ø) |
|
R/errors.R | 95.45% <0.00%> (+0.45%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e996029...ecb3261. Read the comment docs.
Oh, and add yourself as a contributor, please!
hmm I don't think we can use the length to figure this out. For instance if you reverse a vector it will have the same length but the correlations will be wrong if they are not recomputed properly. Maybe one thing that could help is to include some kind of identity vector in the proxy that assigns a unique identifier to the values. This seems like a hard problem though.
Oh, and add yourself as a contributor, please!
Do you mean in DESCRIPTION? or in NEWS?
Oh, and add yourself as a contributor, please!
Do you mean in DESCRIPTION? or in NEWS?
DESCRIPTION.
hmm I don't think we can use the length to figure this out. For instance if you reverse a vector it will have the same length but the correlations will be wrong if they are not recomputed properly. Maybe one thing that could help is to include some kind of identity vector in the proxy that assigns a unique identifier to the values. This seems like a hard problem though.
Yeap, it was a simplification to explain the idea behind the identifier, but probably looking at the length is not a good indicator. In fact, reversing a vector currently doesn't hold the correlations either, because rev.default
uses [
behind the scenes and I didn't consider these subsetting use cases.
I think I need a trace of calls to really understand how {dplyr} and {vctrs} interact. Do you know a better way of doing this than calling base::trace
for every single function inside both packages and then running e.g. a mutate?
dplyr uses vctrs operations like vec_match()
, vec_c()
, vec_unchop()
, vec_slice()
, etc. I have started to document the internal dependencies of each operation in special "Dependencies" sections, see https://vctrs.r-lib.org/reference/vec_count.html#dependencies for an example. By following the deps links recursively, you can get an overview of how the operations work. There is also https://vctrs.r-lib.org/reference/reference-faq-compatibility.html that might be useful.
Here you go @Enchufa2
Would you like to do the {units} package? I'd be happy to send a PR as well for units (and quantities)!
One more question. I see:
library(errors)
library(dplyr, warn.conflicts=FALSE)
iris.q <- head(iris)
for (i in 1:4)
errors(iris.q[,i]) <- iris.q[,i] * 0.05
tracer <- quote({
call <- try(match.call(), silent=TRUE)
.trace.ind <<- .trace.ind + 1
if (!inherits(call, "try-error")) {
indent <- paste0(rep(" ", .trace.ind), collapse="")
line <- paste0(indent, as.character(call))
cat(line, "\n")
}
})
exit <- quote({ .trace.ind <<- .trace.ind - 1 })
pkgs <- c("dplyr", "vctrs", "errors")
for (pkg in pkgs) {
where <- asNamespace(pkg)
funs <- ls(where)
funs <- funs[sapply(funs, function(x) is.function(get(x, where)))]
trace(funs, tracer=tracer, exit=exit, print=FALSE, where=where)
#untrace(funs, where=where)
}
#> [snip]
.trace.ind <- 0
res <- iris.q %>%
mutate(another = Sepal.Length + Sepal.Width)
#> mutate.data.frame . Sepal.Length + Sepal.Width
#> group_rows data
#> group_data .data
#> group_data.data.frame .data
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> local_mask self frame
#> context_local mask x frame
#> context_poke name value
#> group_keys data
#> group_keys.data.frame data
#> group_data .tbl
#> group_data.data.frame .tbl
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> group_vars data
#> group_vars.data.frame data
#> group_data x
#> group_data.data.frame x
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> is_grouped_df data
#> map seq_len(ncol(data)) function(.x) expr(promise_fn(!!.x))
#> map_lgl private$resolved is.null
#> map_lgl private$resolved is.null
#> cond2int !inherits(e1, "errors") !inherits(e2, "errors")
#> propagate unclass(NextMethod()) e1 e2 deriv[[1]] deriv[[2]]
#> errors x
#> errors.errors x
#> errors y
#> errors.errors y
#> covar x y
#> covar.errors x y
#> set_errors xx sqrt(var)
#> set_errors.numeric xx sqrt(var)
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> ids idx
#> ids idy
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_unchop chunks rows
#> vec_proxy x
#> vec_proxy x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` numeric(0)
#> errors<-.numeric `*tmp*` numeric(0)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_ptype_finalise.default x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` numeric(0)
#> errors<-.numeric `*tmp*` numeric(0)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> drop_errors.errors to
#> set_errors out errors(x)
#> set_errors.numeric out errors(x)
#> errors x
#> errors.errors x
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_proxy x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> map_lgl new_columns inherits rlang_zap
#> context_poke mask NULL
#> dplyr_col_modify .data cols
#> dplyr_col_modify.data.frame .data cols
#> vec_recycle_common !!!cols nrow(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> dplyr_vec_data data
#> vec_data x
#> vec_assert x
#> vec_is_vector x
#> vec_proxy x
#> has_dim x
#> vec_set_attributes x list(names = names(x))
#> unset_s4 x
#> new_data_frame out nrow(x)
#> new_data_frame out nrow(data)
#> dplyr_reconstruct out data
#> dplyr_new_data_frame data
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> dplyr_reconstruct_dispatch data template
#> dplyr_reconstruct.data.frame data template
Is there any way to carry the "id" attribute around when new_data_frame
is created so that it can be recovered by vec_restore
? I think that this would be enough to keep correlations for simple cases.
IIUC I think you'd want to use the id
of to
in vec_restore()
, but I think that would yield unpredictable semantics.
IIUC I think you'd want to use the
id
ofto
invec_restore()
, but I think that would yield unpredictable semantics.
Yes! I'll merge this PR and I'll play a little bit before introducing that change, but it just works, and it doesn't break the summarise case. What potential problems do you see?
In theory, the object returned by vec_proxy
should contain all the information, right? It seems that, with the current implementation, one must choose "simple" attributes or attributes that depend on the data, and with the same length of the data. This is a mixed situation in which one attribute depends on the data, but another one doesn't. Moreover, there could be attributes that depend on the data, but with a different length, so a data frame would be no longer appropriate.
Would you like to do the {units} package? I'd be happy to send a PR as well for units (and quantities)!
I you have cycles for this, yes, I would appreciate that help a lot. Because it's clear that I still don't have a full grasp of this mechanism.
This is a mixed situation in which one attribute depends on the data, but another one doesn't
If the attribute doesn't depend on the data, then in theory it's part of the type, and should be contained in the to
argument of vec_restore()
.
it just works, and it doesn't break the summarise case. What potential problems do you see?
The values being restored might come from a different vector (e.g. with vec_assign()
), might come from the same vector but with a different order, etc. In general you can't assume the restored vector contains the same data as the proxied one. The point of proxying is to manipulate the memory. So this approach will only work in the fringe case of restoring exactly the same data.
This is why I suggested to include id columns in the proxy, the first column would be a rep(list(id_env), size)
, and the second would be seq_len(size)
to identify the values. The combination would give you a robust way of detecting manipulations that invalidate the correlation. Performance-wise, this might be too onerous though.
I you have cycles for this, yes, I would appreciate that help a lot. Because it's clear that I still don't have a full grasp of this mechanism.
Sounds good. And no worries, we're still learning as well, this is research in progress :-)
it just works, and it doesn't break the summarise case. What potential problems do you see?
The values being restored might come from a different vector (e.g. with
vec_assign()
), might come from the same vector but with a different order, etc. In general you can't assume the restored vector contains the same data as the proxied one. The point of proxying is to manipulate the memory. So this approach will only work in the fringe case of restoring exactly the same data.This is why I suggested to include id columns in the proxy, the first column would be a
rep(list(id_env), size)
, and the second would beseq_len(size)
to identify the values. The combination would give you a robust way of detecting manipulations that invalidate the correlation. Performance-wise, this might be too onerous though.
What if
vec_proxy.errors <- function(x, ...) {
data <- drop_errors.errors(x)
errors <- attr(x, "errors")
# Simplifies coercion methods
errors <- as.double(errors)
# The `errors` are a vectorised attribute, which requires a data
# frame proxy
- vctrs::new_data_frame(
+ df <- vctrs::new_data_frame(
list(data = data, errors = errors),
n = length(data)
)
+ attr(df, "id") <- attr(x, "id")
+ df
}
Is there any guarantee that this attribute would survive to see vec_restore
?
This is UB from the point of view of vctrs. But even if not, it would also incorrectly propagate the id
, even though there is no guarantee it will restore to the same data. If not the same data, the cached correlations will be invalid.
A good use case to test:
dplyr::slice_sample(iris_errors, n = nrow(iris_errors))
The data will restore with the same size, but it will be different. Propagating the id
is incorrect in that case.
Yeap, you're right. And here nor {dplyr} nor {vctrs} are calling any {errors} method.
So basically there's missing information at vec_restore.errors
, because to
could be the final object (e.g., for dplyr::mutate(iris_errors, other = Sepal.Length + Sepal.Width)
, because Ops.errors
has been called, the id
is what it should be already, and covariances have been propagated), or the original object (e.g., your dplyr::slice_sample
example).
In the first case, .errors
operations happen before vec_proxy
and then nothing (?) happens between vec_proxy
and vec_restore
. In the second case, no .errors
methods are called, and then reordering happens between vec_proxy
and vec_restore
without using [.errors
. Without knowing in which situation we are at vec_restore
, it's basically impossible to decide whether a new id
is required and whether covariances need to be adjusted. Ideally, [.errors
should be called and no further operations would be required (as in the first case).
The situation is similar to [
:
x <- set_errors(1:3, 1:3)
y <- set_errors(3:1, 3:1)
z <- x + y
correl(x, z)
#> [1] 0.3162278 0.7071068 0.9486833
correl(x[], z)
#> NULL
correl(x[seq_along(x)], z)
#> NULL
Yes, as I said, this particular use case is not currently covered in [.errors
. What I meant is that it could be easily covered in [.errors
, because there's all the information needed: 1) we know the operation (we are in fact inside the subsetting operator) and 2) we know the indices. If [.errors
is not called, then in vec_restore
we could know the indices (if we put them in the data frame, as you suggested), but we don't know what kind of operation was performed.
With the combination of a vector identifier and an element identifier in the proxy you'd have all the information as well in vec_restore()
. If all the vector identifiers are identical, it means the data was reordered or subsetted. With altrep, non-transformative operations on these proxies would be efficient.
By the way your tracing routine is fantastic, thanks for that! It helped my workflow twice already.
:D It was a quick test and for sure it can be done better. But if it's useful, I could make a package out of it, or send a PR to... {pryr} may be a good place?
Maybe lobstr? What do you think @hadley and @gaborcsardi?
Just did a few improvements:
# Same logic as rlang backtrace formatting
call_add_namespace <- function(call, fn) {
if (!is.call(call) || !is.symbol(call[[1]])) {
return(call)
}
sym <- call[[1]]
nm <- as.character(sym)
if (nm %in% c("::", ":::")) {
return(call)
}
env <- environment(fn)
top <- topenv(env)
if (identical(env, globalenv())) {
prefix <- "global"
op <- "::"
} else if (isNamespace(top)) {
prefix <- rlang::ns_env_name(top)
if (rlang:::ns_exports_has(top, nm)) {
op <- "::"
} else {
op <- ":::"
}
} else {
return(call)
}
namespaced_sym <- call(op, as.symbol(prefix), sym)
call[[1]] <- namespaced_sym
call
}
trace_pkgs <- function(pkgs, max_level = Inf, ..., regexp = NULL) {
# Avoids namespace loading issues
lapply(pkgs, requireNamespace, quietly = TRUE)
trace_level <- 0
# Create a thunk because `trace()` sloppily transforms functions into calls
tracer <- rlang::call2(function() {
trace_level <<- trace_level + 1
if (trace_level > max_level) {
return()
}
# Work around sys.foo() weirdness
get_fn <- rlang::call2(function(fn = sys.function(sys.parent())) fn)
fn <- eval(get_fn, env = parent.frame())
try(silent = TRUE, {
call <- evalq(base::match.call(), envir = parent.frame())
call <- call_add_namespace(call, fn)
indent <- paste0(rep(" ", (trace_level - 1) * 2), collapse = "")
line <- paste0(indent, rlang::as_label(call))
cat(line, "\n")
})
})
exit <- rlang::call2(function() {
trace_level <<- trace_level - 1
})
if (length(regexp) == 1) {
regexp <- rep(regexp, length(pkgs))
}
for (i in seq_along(pkgs)) {
pkg <- pkgs[[i]]
ns <- asNamespace(pkg)
ns_fns <- names(Filter(is.function, as.list(ns)))
if (!is.null(regexp)) {
ns_fns <- ns_fns[grepl(regexp[[i]], ns_fns)]
}
suppressMessages(trace(
ns_fns,
tracer = tracer,
exit = exit,
print = FALSE,
where = ns
))
message(sprintf(
"Tracing %d functions in %s.",
length(ns_fns),
pkg
))
}
}
This ensures the calls are namespaced, allows setting a maximum nesting level, and supplying a regexp to match functions:
trace_pkgs(c("vctrs", "tibble"), max_level = 2, regexp = "^(vec|new)_")
#> Tracing 459 functions in vctrs.
#> Tracing 2 functions in tibble.
library(dplyr)
invisible(mtcars %>% transmute(ts(cyl)))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_unchop(x = chunks, indices = rows)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_default_cast(...)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_slice(x = i, i = existing)
#> vctrs::vec_slice(x = value, i = existing)
#> vctrs:::vec_dim_n(x = x)
#> vctrs::vec_recycle_common(!!!cols, .size = nrow(data))
#> vctrs::vec_data(x = x)
#> vctrs::vec_assert(x = x)
#> vctrs::vec_proxy(x = x)
#> vctrs:::vec_set_attributes(x = x, attrib = list(names = names(x)))
#> vctrs::new_data_frame(x = out, n = nrow(x))
#> vctrs::new_data_frame(x = out, n = nrow(data))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_as_location(i = loc, n = ncol(.data), names = names(.data))
Great work! I'm gonna use this a lot.
Another potential improvement: optionally, the exit
expression could call returnValue()
to print some info about the returned value.
To me, it feels like rlang is the right place for it, because it feels like a bit of a peer to last_trace()
.
Part of r-lib/vctrs#1130
I have moved the test helpers from
testthat.R
to a helper file, this way it works with both R CMD check anddevtools::test()
. Let me know if you prefer the original location for these helpers and I'll revert.This is an interesting class to work with and hopefully we'll be able to simplify the coercion methods in the future, once we know how to handle higher order types in a more generic way.