traitecoevo / austraits.build

Source for AusTraits
Other
16 stars 2 forks source link

Speed- Make build and tests run faster #515

Closed dfalster closed 2 years ago

dfalster commented 3 years ago

In addressing #512, i noted how slow some of the tests were very slow

An initial commit identified the function format_sites as being very slow.

Suggest running some profiling to identify the slowest parts, then see if we can optimise those

dfalster commented 3 years ago

The above commit improves format_sites, leading to faster build. A comparison before and after shows it reduces build time for ANBG_2019 (the source with the most site data):

Before

> system.time(remake::make("ANBG_2019"))
[  READ ]                                |  # loading sources
<  MAKE > ANBG_2019
[ BUILD ] ANBG_2019                      |  ANBG_2019 <- load_study("data/ANBG_2019/data.csv", ANBG_2019_config)
   user  system elapsed 
 18.061   0.365  18.446 

After

> system.time(remake::make("ANBG_2019"))
[  LOAD ] 
[  READ ]                                |  # loading sources
<  MAKE > ANBG_2019
[ BUILD ] ANBG_2019                      |  ANBG_2019 <- load_study("data/ANBG_2019/data.csv", ANBG_2019_config)
[  READ ]                                |  # loading packages
   user  system elapsed 
 12.399   0.406  12.826 
dfalster commented 3 years ago

Some info here on how to profile

http://adv-r.had.co.nz/Profiling.html

ehwenk commented 2 years ago

complex studies that are good to run tests on:

Catford_2014 - complex custom R code Maslin_2012 - complex custom R code (old style), with some flowering times not properly read in; long format ANBG_2019 - long format; example where some rows of data are not "traits" and therefore yields "unsupported trait" error; lots of substitutions Cheal_2017 - currently lots of excluded observations; also lots of substitutions White_2020 - complete taxonomic changes Bloomfield_2018 - quite a few excluded numeric values; includes sites, date Duan_2015 - complex contexts Westoby_2014 - big collection of numeric traits; but no "issues"; includes sites

dfalster commented 2 years ago

HI Lizzy, thanks. Can we avoid the really big datasets for this? And also datasets that are very stable (unlikely to be tweaked further)?

So the following seem good: Catford_2014, Duan_2015, Maslin_2012, Westoby_2014. Any others you want to add?

Actually, might be good to include one large dataset, but not multiple

ehwenk commented 2 years ago

Some others that have specific types of errors/formats, but aren't terrible large Baker_2019 - 1 excluded taxon, 1 taxonomic update, substitutions Tomlinson_2019 - value out of range