ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

`filter_obs`: consider allowing multiple datasets at once #179

Closed zachary-foster closed 5 years ago

zachary-foster commented 5 years ago

If I have an object like:

<Taxmap>
  629 taxa: aac. Bacteria ... azs. Nitrospinaceae
  629 edges: NA->aac, aac->aad, aac->aae ... anm->azr, ann->azs
  3 data sets:
    tax_data:
      # A tibble: 3,070,243 x 3
        taxon_id class                      input                    
        <chr>    <chr>                      <chr>                    
      1 ano      "=Root;rootrank;Bacteria;… "\tLineage=Root;rootrank…
      2 ano      "=Root;rootrank;Bacteria;… "\tLineage=Root;rootrank…
      3 ano      "=Root;rootrank;Bacteria;… "\tLineage=Root;rootrank…
      # ... with 3.07e+06 more rows
    class_data:
      # A tibble: 16,684,640 x 4
        taxon_id input_index taxon_name           rdp_rank
        <chr>          <int> <chr>                <chr>   
      1 aac                1 Bacteria             d       
      2 aad                1 "\"Actinobacteria\"" p       
      3 aca                1 Actinobacteria       c       
      # ... with 1.668e+07 more rows
    sequence:
      3070243 DNA sequences in binary format stored in a list.

      Mean sequence length: 1042.096 
         Shortest sequence: 400 
          Longest sequence: 2922 

      Labels:
      ano
      ano
      ano
      ano
      ano
      ano
      ...

      More than 10 million nucleotides: not printing base composition
  0 functions:

where there are multiple datasets with the same length and same taxon IDs, it would be nice to filter both at once.

filter_obs(rdp, c("tax_data", "sequence"), vapply(sequence, length, numeric(1)) >= min_seq_length, drop_taxa = TRUE)

instead of;

  long_enough <- vapply(rdp$data$sequence, length, numeric(1)) >= min_seq_length
  rdp <- filter_obs(rdp, "sequence", long_enough, drop_taxa = TRUE)
  rdp <- filter_obs(rdp, "tax_data", long_enough, drop_taxa = TRUE)