load_taxonomic_resources is slow

fontikar commented 7 months ago

extract_genus needs to speed up.

One option is preprocess the parquets before saving. (Not ideal, because stable vs current resources are then different)

Relevant SO post: https://stackoverflow.com/questions/70945318/r-large-data-table-why-is-extracting-a-word-with-regex-faster-than-stringrword

dfalster commented 7 months ago

@wcornwell identified that most time is being spent on extract genus. The code is

extract_genus <- function(taxon_name) {
  genus <- 
    ifelse(
      stringr::word(taxon_name, 1) %>% stringr::str_to_lower() == "x",
      paste(stringr::word(taxon_name, 1) %>% stringr::str_to_lower(), stringr::word(taxon_name, 2) %>% stringr::str_to_sentence()),
      stringr::word(taxon_name, 1) %>% stringr::str_to_sentence()
    )
  genus
}

Current APC has 110000 names. The proposed revision below cuts the run time from 4.4 s down to 0.13s. So, the time to run dev tools::load_all drops from 23.1s to 6.9s. There'll be further efficiencies that are possible.

extract_genus <- function(taxon_name) {

  genus <- str_split_i(taxon_name, " ", 1)
  word2 <- str_split_i(taxon_name, " ", 2)

  genus <- word1

  # Deal with names that being with x, 
  # e.g."x Taurodium x toveyanum" or "x Glossadenia tutelata"
  i <- stringr::str_to_lower(genus) == "x"
  genus[i] <- paste("x", word2[i])

  genus %>% stringr::str_to_sentence()
}

dfalster commented 7 months ago

@wcornwell - Can you paste example code for profiling? I'm just running

system.time({devtools::load_all()})

I've used profvis before, but don't have code handy

wcornwell commented 7 months ago

same syntax basically, but with better output.

profvis::profvis({
resources <- load_taxonomic_resources()
})

takes a little while to understand the interactive browser thing

wcornwell commented 7 months ago

seems done?

traitecoevo / APCalign

load_taxonomic_resources is slow #187