r-quantities / errors

Uncertainty Propagation for R Vectors
https://r-quantities.github.io/errors
Other
49 stars 8 forks source link

Add vctrs methods #38

Closed lionel- closed 4 years ago

lionel- commented 4 years ago

Part of r-lib/vctrs#1130

I have moved the test helpers from testthat.R to a helper file, this way it works with both R CMD check and devtools::test(). Let me know if you prefer the original location for these helpers and I'll revert.

This is an interesting class to work with and hopefully we'll be able to simplify the coercion methods in the future, once we know how to handle higher order types in a more generic way.

codecov[bot] commented 4 years ago

Codecov Report

Merging #38 into master will increase coverage by 0.31%. The diff coverage is 96.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
+ Coverage   93.99%   94.30%   +0.31%     
==========================================
  Files           8       10       +2     
  Lines         333      369      +36     
==========================================
+ Hits          313      348      +35     
- Misses         20       21       +1     
Impacted Files Coverage Δ
R/misc.R 82.92% <ø> (-0.95%) :arrow_down:
R/tidyverse.R 95.12% <95.12%> (ø)
R/init.R 100.00% <100.00%> (ø)
R/ops.R 100.00% <100.00%> (ø)
R/utils.R 100.00% <100.00%> (ø)
R/errors.R 95.45% <0.00%> (+0.45%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e996029...ecb3261. Read the comment docs.

Enchufa2 commented 4 years ago

Oh, and add yourself as a contributor, please!

lionel- commented 4 years ago

hmm I don't think we can use the length to figure this out. For instance if you reverse a vector it will have the same length but the correlations will be wrong if they are not recomputed properly. Maybe one thing that could help is to include some kind of identity vector in the proxy that assigns a unique identifier to the values. This seems like a hard problem though.

lionel- commented 4 years ago

Oh, and add yourself as a contributor, please!

Do you mean in DESCRIPTION? or in NEWS?

Enchufa2 commented 4 years ago

Oh, and add yourself as a contributor, please!

Do you mean in DESCRIPTION? or in NEWS?

DESCRIPTION.

hmm I don't think we can use the length to figure this out. For instance if you reverse a vector it will have the same length but the correlations will be wrong if they are not recomputed properly. Maybe one thing that could help is to include some kind of identity vector in the proxy that assigns a unique identifier to the values. This seems like a hard problem though.

Yeap, it was a simplification to explain the idea behind the identifier, but probably looking at the length is not a good indicator. In fact, reversing a vector currently doesn't hold the correlations either, because rev.default uses [ behind the scenes and I didn't consider these subsetting use cases.

I think I need a trace of calls to really understand how {dplyr} and {vctrs} interact. Do you know a better way of doing this than calling base::trace for every single function inside both packages and then running e.g. a mutate?

lionel- commented 4 years ago

dplyr uses vctrs operations like vec_match(), vec_c(), vec_unchop(), vec_slice(), etc. I have started to document the internal dependencies of each operation in special "Dependencies" sections, see https://vctrs.r-lib.org/reference/vec_count.html#dependencies for an example. By following the deps links recursively, you can get an overview of how the operations work. There is also https://vctrs.r-lib.org/reference/reference-faq-compatibility.html that might be useful.

lionel- commented 4 years ago

Here you go @Enchufa2

Would you like to do the {units} package? I'd be happy to send a PR as well for units (and quantities)!

Enchufa2 commented 4 years ago

One more question. I see:

library(errors)
library(dplyr, warn.conflicts=FALSE)

iris.q <- head(iris)
for (i in 1:4)
  errors(iris.q[,i]) <- iris.q[,i] * 0.05

tracer <- quote({
  call <- try(match.call(), silent=TRUE)
  .trace.ind <<- .trace.ind + 1
  if (!inherits(call, "try-error")) {
    indent <- paste0(rep(" ", .trace.ind), collapse="")
    line <- paste0(indent, as.character(call))
    cat(line, "\n")
  }
})

exit <- quote({ .trace.ind <<- .trace.ind - 1 })

pkgs <- c("dplyr", "vctrs", "errors")
for (pkg in pkgs) {
  where <- asNamespace(pkg)
  funs <- ls(where)
  funs <- funs[sapply(funs, function(x) is.function(get(x, where)))]
  trace(funs, tracer=tracer, exit=exit, print=FALSE, where=where)
  #untrace(funs, where=where)
}
#> [snip]

.trace.ind <- 0
res <- iris.q %>%
  mutate(another = Sepal.Length + Sepal.Width)
#>  mutate.data.frame  .  Sepal.Length + Sepal.Width 
#>    group_rows    data 
#>     group_data     .data 
#>      group_data.data.frame      .data 
#>       new_list_of       list(seq_len(nrow(.data)))       integer() 
#>        vec_is_list        x 
#>        vec_size        ptype 
#>         validate_names         .data 
#>          names_all_or_nothing          nms 
#>         vec_set_attributes         .data         attrib 
#>       new_data_frame       list(.rows = rows)       1 
#>    local_mask    self    frame 
#>     context_local     mask     x     frame 
#>      context_poke      name      value 
#>    group_keys    data 
#>     group_keys.data.frame     data 
#>      group_data      .tbl 
#>       group_data.data.frame       .tbl 
#>        new_list_of        list(seq_len(nrow(.data)))        integer() 
#>         vec_is_list         x 
#>         vec_size         ptype 
#>          validate_names          .data 
#>           names_all_or_nothing           nms 
#>          vec_set_attributes          .data          attrib 
#>        new_data_frame        list(.rows = rows)        1 
#>    group_vars    data 
#>     group_vars.data.frame     data 
#>      group_data      x 
#>       group_data.data.frame       x 
#>        new_list_of        list(seq_len(nrow(.data)))        integer() 
#>         vec_is_list         x 
#>         vec_size         ptype 
#>          validate_names          .data 
#>           names_all_or_nothing           nms 
#>          vec_set_attributes          .data          attrib 
#>        new_data_frame        list(.rows = rows)        1 
#>    is_grouped_df    data 
#>    map    seq_len(ncol(data))    function(.x) expr(promise_fn(!!.x)) 
#>    map_lgl    private$resolved    is.null 
#>    map_lgl    private$resolved    is.null 
#>    cond2int    !inherits(e1, "errors")    !inherits(e2, "errors") 
#>    propagate    unclass(NextMethod())    e1    e2    deriv[[1]]    deriv[[2]] 
#>     errors     x 
#>      errors.errors      x 
#>     errors     y 
#>      errors.errors      y 
#>     covar     x     y 
#>      covar.errors      x      y 
#>     set_errors     xx     sqrt(var) 
#>      set_errors.numeric      xx      sqrt(var) 
#>       errors<-       `*tmp*`       c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>        errors<-.numeric        `*tmp*`        c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>         new_id 
#>         errors<-.errors         x         value 
#>     ids     idx 
#>     ids     idy 
#>    drop_errors.errors    x 
#>    vctrs::new_data_frame    list(data = data, errors = errors)    length(data) 
#>    drop_errors.errors    x 
#>    vctrs::new_data_frame    list(data = data, errors = errors)    length(data) 
#>    vec_unchop    chunks    rows 
#>     vec_proxy     x 
#>     vec_proxy     x 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     set_errors     x$data     x$errors 
#>      set_errors.numeric      x$data      x$errors 
#>       errors<-       `*tmp*`       numeric(0) 
#>        errors<-.numeric        `*tmp*`        numeric(0) 
#>         new_id 
#>         errors<-.errors         x         value 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     vec_ptype_finalise.default     x 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     set_errors     x$data     x$errors 
#>      set_errors.numeric      x$data      x$errors 
#>       errors<-       `*tmp*`       numeric(0) 
#>        errors<-.numeric        `*tmp*`        numeric(0) 
#>         new_id 
#>         errors<-.errors         x         value 
#>      drop_errors.errors      x 
#>      drop_errors.errors      to 
#>      set_errors      out      errors(x) 
#>       set_errors.numeric       out       errors(x) 
#>        errors        x 
#>         errors.errors         x 
#>        errors<-        `*tmp*`        c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>         errors<-.numeric         `*tmp*`         c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>          new_id 
#>          errors<-.errors          x          value 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     vec_proxy     x 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     drop_errors.errors     x 
#>     vctrs::new_data_frame     list(data = data, errors = errors)     length(data) 
#>     set_errors     x$data     x$errors 
#>      set_errors.numeric      x$data      x$errors 
#>       errors<-       `*tmp*`       c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>        errors<-.numeric        `*tmp*`        c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618) 
#>         new_id 
#>         errors<-.errors         x         value 
#>    map_lgl    new_columns    inherits    rlang_zap 
#>   context_poke   mask   NULL 
#>   dplyr_col_modify   .data   cols 
#>    dplyr_col_modify.data.frame    .data    cols 
#>     vec_recycle_common     !!!cols     nrow(data) 
#>      drop_errors.errors      x 
#>      vctrs::new_data_frame      list(data = data, errors = errors)      length(data) 
#>     dplyr_vec_data     data 
#>      vec_data      x 
#>       vec_assert       x 
#>        vec_is_vector        x 
#>       vec_proxy       x 
#>       has_dim       x 
#>       vec_set_attributes       x       list(names = names(x)) 
#>       unset_s4       x 
#>      new_data_frame      out      nrow(x) 
#>     new_data_frame     out     nrow(data) 
#>     dplyr_reconstruct     out     data 
#>      dplyr_new_data_frame      data 
#>        drop_errors.errors        x 
#>        vctrs::new_data_frame        list(data = data, errors = errors)        length(data) 
#>      dplyr_reconstruct_dispatch      data      template 
#>       dplyr_reconstruct.data.frame       data       template

Is there any way to carry the "id" attribute around when new_data_frame is created so that it can be recovered by vec_restore? I think that this would be enough to keep correlations for simple cases.

lionel- commented 4 years ago

IIUC I think you'd want to use the id of to in vec_restore(), but I think that would yield unpredictable semantics.

Enchufa2 commented 4 years ago

IIUC I think you'd want to use the id of to in vec_restore(), but I think that would yield unpredictable semantics.

Yes! I'll merge this PR and I'll play a little bit before introducing that change, but it just works, and it doesn't break the summarise case. What potential problems do you see?

In theory, the object returned by vec_proxy should contain all the information, right? It seems that, with the current implementation, one must choose "simple" attributes or attributes that depend on the data, and with the same length of the data. This is a mixed situation in which one attribute depends on the data, but another one doesn't. Moreover, there could be attributes that depend on the data, but with a different length, so a data frame would be no longer appropriate.

Would you like to do the {units} package? I'd be happy to send a PR as well for units (and quantities)!

I you have cycles for this, yes, I would appreciate that help a lot. Because it's clear that I still don't have a full grasp of this mechanism.

lionel- commented 4 years ago

This is a mixed situation in which one attribute depends on the data, but another one doesn't

If the attribute doesn't depend on the data, then in theory it's part of the type, and should be contained in the to argument of vec_restore().

it just works, and it doesn't break the summarise case. What potential problems do you see?

The values being restored might come from a different vector (e.g. with vec_assign()), might come from the same vector but with a different order, etc. In general you can't assume the restored vector contains the same data as the proxied one. The point of proxying is to manipulate the memory. So this approach will only work in the fringe case of restoring exactly the same data.

This is why I suggested to include id columns in the proxy, the first column would be a rep(list(id_env), size), and the second would be seq_len(size) to identify the values. The combination would give you a robust way of detecting manipulations that invalidate the correlation. Performance-wise, this might be too onerous though.

I you have cycles for this, yes, I would appreciate that help a lot. Because it's clear that I still don't have a full grasp of this mechanism.

Sounds good. And no worries, we're still learning as well, this is research in progress :-)

Enchufa2 commented 4 years ago

it just works, and it doesn't break the summarise case. What potential problems do you see?

The values being restored might come from a different vector (e.g. with vec_assign()), might come from the same vector but with a different order, etc. In general you can't assume the restored vector contains the same data as the proxied one. The point of proxying is to manipulate the memory. So this approach will only work in the fringe case of restoring exactly the same data.

This is why I suggested to include id columns in the proxy, the first column would be a rep(list(id_env), size), and the second would be seq_len(size) to identify the values. The combination would give you a robust way of detecting manipulations that invalidate the correlation. Performance-wise, this might be too onerous though.

What if

 vec_proxy.errors <- function(x, ...) {
   data <- drop_errors.errors(x)
   errors <- attr(x, "errors")

   # Simplifies coercion methods
   errors <- as.double(errors)

   # The `errors` are a vectorised attribute, which requires a data
   # frame proxy
-  vctrs::new_data_frame(
+  df <- vctrs::new_data_frame(
     list(data = data, errors = errors),
     n = length(data)
   )
+  attr(df, "id") <- attr(x, "id")
+  df
 }

Is there any guarantee that this attribute would survive to see vec_restore?

lionel- commented 4 years ago

This is UB from the point of view of vctrs. But even if not, it would also incorrectly propagate the id, even though there is no guarantee it will restore to the same data. If not the same data, the cached correlations will be invalid.

lionel- commented 4 years ago

A good use case to test:

dplyr::slice_sample(iris_errors, n = nrow(iris_errors))

The data will restore with the same size, but it will be different. Propagating the id is incorrect in that case.

Enchufa2 commented 4 years ago

Yeap, you're right. And here nor {dplyr} nor {vctrs} are calling any {errors} method.

So basically there's missing information at vec_restore.errors, because to could be the final object (e.g., for dplyr::mutate(iris_errors, other = Sepal.Length + Sepal.Width), because Ops.errors has been called, the id is what it should be already, and covariances have been propagated), or the original object (e.g., your dplyr::slice_sample example).

In the first case, .errors operations happen before vec_proxy and then nothing (?) happens between vec_proxy and vec_restore. In the second case, no .errors methods are called, and then reordering happens between vec_proxy and vec_restore without using [.errors. Without knowing in which situation we are at vec_restore, it's basically impossible to decide whether a new id is required and whether covariances need to be adjusted. Ideally, [.errors should be called and no further operations would be required (as in the first case).

lionel- commented 4 years ago

The situation is similar to [:

x <- set_errors(1:3, 1:3)
y <- set_errors(3:1, 3:1)
z <- x + y

correl(x, z)
#> [1] 0.3162278 0.7071068 0.9486833

correl(x[], z)
#> NULL

correl(x[seq_along(x)], z)
#> NULL
Enchufa2 commented 4 years ago

Yes, as I said, this particular use case is not currently covered in [.errors. What I meant is that it could be easily covered in [.errors, because there's all the information needed: 1) we know the operation (we are in fact inside the subsetting operator) and 2) we know the indices. If [.errors is not called, then in vec_restore we could know the indices (if we put them in the data frame, as you suggested), but we don't know what kind of operation was performed.

lionel- commented 4 years ago

With the combination of a vector identifier and an element identifier in the proxy you'd have all the information as well in vec_restore(). If all the vector identifiers are identical, it means the data was reordered or subsetted. With altrep, non-transformative operations on these proxies would be efficient.

lionel- commented 4 years ago

By the way your tracing routine is fantastic, thanks for that! It helped my workflow twice already.

Enchufa2 commented 4 years ago

:D It was a quick test and for sure it can be done better. But if it's useful, I could make a package out of it, or send a PR to... {pryr} may be a good place?

lionel- commented 4 years ago

Maybe lobstr? What do you think @hadley and @gaborcsardi?

Just did a few improvements:

# Same logic as rlang backtrace formatting
call_add_namespace <- function(call, fn) {
  if (!is.call(call) || !is.symbol(call[[1]])) {
    return(call)
  }

  sym <- call[[1]]
  nm <- as.character(sym)

  if (nm %in% c("::", ":::")) {
    return(call)
  }

  env <- environment(fn)
  top <- topenv(env)

  if (identical(env, globalenv())) {
    prefix <- "global"
    op <- "::"
  } else if (isNamespace(top)) {
    prefix <- rlang::ns_env_name(top)
    if (rlang:::ns_exports_has(top, nm)) {
      op <- "::"
    } else {
      op <- ":::"
    }
  } else {
    return(call)
  }

  namespaced_sym <- call(op, as.symbol(prefix), sym)
  call[[1]] <- namespaced_sym
  call
}

trace_pkgs <- function(pkgs, max_level = Inf, ..., regexp = NULL) {
  # Avoids namespace loading issues
  lapply(pkgs, requireNamespace, quietly = TRUE)

  trace_level <- 0

  # Create a thunk because `trace()` sloppily transforms functions into calls
  tracer <- rlang::call2(function() {
    trace_level <<- trace_level + 1

    if (trace_level > max_level) {
      return()
    }

    # Work around sys.foo() weirdness
    get_fn <- rlang::call2(function(fn = sys.function(sys.parent())) fn)
    fn <- eval(get_fn, env = parent.frame())

    try(silent = TRUE, {
      call <- evalq(base::match.call(), envir = parent.frame())
      call <- call_add_namespace(call, fn)

      indent <- paste0(rep(" ", (trace_level - 1) * 2), collapse = "")
      line <- paste0(indent, rlang::as_label(call))
      cat(line, "\n")
    })
  })

  exit <- rlang::call2(function() {
    trace_level <<- trace_level - 1
  })

  if (length(regexp) == 1) {
    regexp <- rep(regexp, length(pkgs))
  }

  for (i in seq_along(pkgs)) {
    pkg <- pkgs[[i]]
    ns <- asNamespace(pkg)
    ns_fns <- names(Filter(is.function, as.list(ns)))

    if (!is.null(regexp)) {
      ns_fns <- ns_fns[grepl(regexp[[i]], ns_fns)]
    }

    suppressMessages(trace(
      ns_fns,
      tracer = tracer,
      exit = exit,
      print = FALSE,
      where = ns
    ))

    message(sprintf(
      "Tracing %d functions in %s.",
      length(ns_fns),
      pkg
    ))
  }
}

This ensures the calls are namespaced, allows setting a maximum nesting level, and supplying a regexp to match functions:

trace_pkgs(c("vctrs", "tibble"), max_level = 2, regexp = "^(vec|new)_")
#> Tracing 459 functions in vctrs.
#> Tracing 2 functions in tibble.

library(dplyr)
invisible(mtcars %>% transmute(ts(cyl)))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#>   vctrs::vec_is_list(x = x)
#>   vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#>   vctrs::vec_is_list(x = x)
#>   vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#>   vctrs::vec_is_list(x = x)
#>   vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_unchop(x = chunks, indices = rows)
#>   vctrs::vec_proxy(x = x)
#>   vctrs::vec_proxy(x = x)
#>   vctrs::vec_default_cast(...)
#>   vctrs::vec_proxy(x = x)
#>   vctrs::vec_slice(x = i, i = existing)
#>   vctrs::vec_slice(x = value, i = existing)
#>   vctrs:::vec_dim_n(x = x)
#> vctrs::vec_recycle_common(!!!cols, .size = nrow(data))
#> vctrs::vec_data(x = x)
#>   vctrs::vec_assert(x = x)
#>   vctrs::vec_proxy(x = x)
#>   vctrs:::vec_set_attributes(x = x, attrib = list(names = names(x)))
#> vctrs::new_data_frame(x = out, n = nrow(x))
#> vctrs::new_data_frame(x = out, n = nrow(data))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#>   vctrs::vec_is_list(x = x)
#>   vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_as_location(i = loc, n = ncol(.data), names = names(.data))
Enchufa2 commented 4 years ago

Great work! I'm gonna use this a lot.

Enchufa2 commented 4 years ago

Another potential improvement: optionally, the exit expression could call returnValue() to print some info about the returned value.

hadley commented 4 years ago

To me, it feels like rlang is the right place for it, because it feels like a bit of a peer to last_trace().