Clean joins - Githubissues

Primarily this adds a somewhat generic helper utility, mutate_db(), which can add columns to joins.

Currently it doesn't have full NSE semantics, and isn't fully pipe-able in the sense that it you cannot first apply other lazy-eval dplyr database commands (filter) since that is not compatible with dbFetch paging through. Probably some trick to force those to evaluate on disk.

In any event, we might want to run this instead in the data-raw dirs, but this default cleaning may be too aggressive.

Here's the unit test illustrating this at work in a real example, in which we create a new column, input, in the itis data, which is the result of clean_names(scientificName).

library(dplyr)
  library(taxadb)
  td_create(c("itis", "ncbi"))

  chameleons <- taxa_tbl("ncbi") %>%
    filter(family == "Chamaeleonidae",
           taxonomicStatus != "accepted") %>%
    select(species = scientificName) %>%
    collect() %>%
    mutate(input = clean_names(species),
           sort = 1:length(species))

  ## Input table with clean names
  ## Let's get some matches, amazing how bad this is.  Need wikidata synonyms
  taxa <- taxa_tbl("itis") %>%
    mutate_db(clean_names, "scientificName", "input") %>%
    right_join(chameleons, copy = TRUE, by = "input") %>%
    arrange(sort)  %>%
    collect()

  ## lots of duplicate matches, pick the first one for now:
  matched <- taxa %>% select(acceptedNameUsageID, sort) %>% distinct() %>%
    group_by(sort) %>% top_n(1, acceptedNameUsageID)

ropensci / taxadb

Clean joins #33