Primarily this adds a somewhat generic helper utility, mutate_db(), which can add columns to joins.
Currently it doesn't have full NSE semantics, and isn't fully pipe-able in the sense that it you cannot first apply other lazy-eval dplyr database commands (filter) since that is not compatible with dbFetch paging through. Probably some trick to force those to evaluate on disk.
In any event, we might want to run this instead in the data-raw dirs, but this default cleaning may be too aggressive.
Here's the unit test illustrating this at work in a real example, in which we create a new column, input, in the itis data, which is the result of clean_names(scientificName).
library(dplyr)
library(taxadb)
td_create(c("itis", "ncbi"))
chameleons <- taxa_tbl("ncbi") %>%
filter(family == "Chamaeleonidae",
taxonomicStatus != "accepted") %>%
select(species = scientificName) %>%
collect() %>%
mutate(input = clean_names(species),
sort = 1:length(species))
## Input table with clean names
## Let's get some matches, amazing how bad this is. Need wikidata synonyms
taxa <- taxa_tbl("itis") %>%
mutate_db(clean_names, "scientificName", "input") %>%
right_join(chameleons, copy = TRUE, by = "input") %>%
arrange(sort) %>%
collect()
## lots of duplicate matches, pick the first one for now:
matched <- taxa %>% select(acceptedNameUsageID, sort) %>% distinct() %>%
group_by(sort) %>% top_n(1, acceptedNameUsageID)
Primarily this adds a somewhat generic helper utility,
mutate_db()
, which can add columns to joins.Currently it doesn't have full NSE semantics, and isn't fully pipe-able in the sense that it you cannot first apply other lazy-eval
dplyr
database commands (filter) since that is not compatible withdbFetch
paging through. Probably some trick to force those to evaluate on disk.In any event, we might want to run this instead in the
data-raw
dirs, but this default cleaning may be too aggressive.Here's the unit test illustrating this at work in a real example, in which we create a new column,
input
, in theitis
data, which is the result ofclean_names(scientificName)
.