ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

Consider allowing the data set to be specified when using NSE #154

Open zachary-foster opened 6 years ago

zachary-foster commented 6 years ago

This would be a solution to the problem described in #153. Say we have:

<Taxmap>
  17 taxa: b. Mammalia, c. Plantae, d. Felidae, e. Notoryctidae, f. Hominidae ... o. typhlops, p. sapiens, q. lycopersicum, r. tuberosum
  17 edges: NA->b, NA->c, b->d, b->e, b->f, c->g, d->h, d->i, e->j, f->k, g->l, h->m, i->n, j->o, k->p, l->q, l->r
  5 data sets:
    info:
      # A tibble: 6 x 4
        taxon_id name  n_legs dangerous
        <chr>    <chr>  <dbl> <lgl>    
      1 m        tiger   4.00 T        
      2 n        cat     4.00 F        
      3 o        mole    4.00 F        
      # ... with 3 more rows
    phylopic_ids: a named vector of 'character' with 6 items
       m. e148eabb-f138-43c6-b1e4-5cda2180485a, n. 12899ba0-9923-4feb-a7f9-758c3c7d5e13 ... r. 63604565-0406-460b-8cb8-1abe954b3f3a
    foods: a list of 6 items named by taxa:
       m, n, o, p, q, r
    abund:
      # A tibble: 8 x 5
        taxon_id code  sample_id count taxon_index
        <chr>    <fct> <fct>     <dbl>       <int>
      1 m        T     A          1.00           1
      2 n        C     A          2.00           2
      3 o        M     B          5.00           3
      # ... with 5 more rows
    abund_2:
      # A tibble: 8 x 5
        taxon_id code  sample_id count taxon_index
        <chr>    <fct> <fct>     <dbl>       <int>
      1 m        T     A          1.00           1
      2 n        C     A          2.00           2
      3 o        M     B          5.00           3
      # ... with 5 more rows
  1 functions:
    reaction

Then, in:

filter_obs(obj, "abund", count > 1)

count is ambiguous and would generate a warning (#153) and recommend that the user could specify the dataset like so:

filter_obs(obj, "abund", abund$count > 1)

For a single operation, this would be equivalent to:

filter_obs(obj, "abund",obj$data$abund$count > 1)

However, when a series of filters is piped together with %>%, obj$data$abund$count might be "outdated" if the taxmap in the pipe was changed due to a previous filtering step.The abund$count would use the changed version.

Seem reasonable @sckott?

sckott commented 6 years ago

Will this solution scale ? e.g, if a user has 1000 tables in the object. I guess each table does have to have a unique name, correct?

zachary-foster commented 6 years ago

Will this solution scale ? e.g, if a user has 1000 tables in the object.

I don't think it will be a problem computationally, but it would certainly be confusing.

I guess each table does have to have a unique name, correct?

Ahh yea, I did not think about that. Good point. Yes, each table would need a unique name. If you had many unnamed tables, then NSE would not be a good way to access that data anyway however. The user can always use the data list directly in that case.

sckott commented 6 years ago

right, so should we enforce unique table names then?

zachary-foster commented 6 years ago

I dont think so, just have NSE only work on named tables and warn the user if they specify something ambiguous and say what is being used. If ambiguous column names are used then I just added the warning:

> x$filter_obs("abund", code == "T")
Warning message:
Ambiguous names used in non-standard evaluation:
   code
These could refer to values in different datasets with the same name. The following values will be used:
   data$abund$code

We could do something similar for multiple datasets with the same name.

Unnamed datasets perhaps could be referenced by [[3]]$code, although that is a little cryptic and arbitrary and harder to implement.

I think of NSE as a convenience (it should not be the only way to do any operation), so it does not need to work in all cases as long as it fails with a good warning/error.

sckott commented 6 years ago

Okay, sounds good