ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

Clean joins #33

Closed cboettig closed 5 years ago

cboettig commented 5 years ago

Primarily this adds a somewhat generic helper utility, mutate_db(), which can add columns to joins.

Currently it doesn't have full NSE semantics, and isn't fully pipe-able in the sense that it you cannot first apply other lazy-eval dplyr database commands (filter) since that is not compatible with dbFetch paging through. Probably some trick to force those to evaluate on disk.

In any event, we might want to run this instead in the data-raw dirs, but this default cleaning may be too aggressive.

Here's the unit test illustrating this at work in a real example, in which we create a new column, input, in the itis data, which is the result of clean_names(scientificName).

library(dplyr)
  library(taxadb)
  td_create(c("itis", "ncbi"))

  chameleons <- taxa_tbl("ncbi") %>%
    filter(family == "Chamaeleonidae",
           taxonomicStatus != "accepted") %>%
    select(species = scientificName) %>%
    collect() %>%
    mutate(input = clean_names(species),
           sort = 1:length(species))

  ## Input table with clean names
  ## Let's get some matches, amazing how bad this is.  Need wikidata synonyms
  taxa <- taxa_tbl("itis") %>%
    mutate_db(clean_names, "scientificName", "input") %>%
    right_join(chameleons, copy = TRUE, by = "input") %>%
    arrange(sort)  %>%
    collect()

  ## lots of duplicate matches, pick the first one for now:
  matched <- taxa %>% select(acceptedNameUsageID, sort) %>% distinct() %>%
    group_by(sort) %>% top_n(1, acceptedNameUsageID)