Closed dfalster closed 2 years ago
The above commit improves format_sites
, leading to faster build. A comparison before and after shows it reduces build time for ANBG_2019
(the source with the most site data):
Before
> system.time(remake::make("ANBG_2019"))
[ READ ] | # loading sources
< MAKE > ANBG_2019
[ BUILD ] ANBG_2019 | ANBG_2019 <- load_study("data/ANBG_2019/data.csv", ANBG_2019_config)
user system elapsed
18.061 0.365 18.446
After
> system.time(remake::make("ANBG_2019"))
[ LOAD ]
[ READ ] | # loading sources
< MAKE > ANBG_2019
[ BUILD ] ANBG_2019 | ANBG_2019 <- load_study("data/ANBG_2019/data.csv", ANBG_2019_config)
[ READ ] | # loading packages
user system elapsed
12.399 0.406 12.826
Some info here on how to profile
complex studies that are good to run tests on:
Catford_2014 - complex custom R code Maslin_2012 - complex custom R code (old style), with some flowering times not properly read in; long format ANBG_2019 - long format; example where some rows of data are not "traits" and therefore yields "unsupported trait" error; lots of substitutions Cheal_2017 - currently lots of excluded observations; also lots of substitutions White_2020 - complete taxonomic changes Bloomfield_2018 - quite a few excluded numeric values; includes sites, date Duan_2015 - complex contexts Westoby_2014 - big collection of numeric traits; but no "issues"; includes sites
HI Lizzy, thanks. Can we avoid the really big datasets for this? And also datasets that are very stable (unlikely to be tweaked further)?
So the following seem good: Catford_2014, Duan_2015, Maslin_2012, Westoby_2014. Any others you want to add?
Actually, might be good to include one large dataset, but not multiple
Some others that have specific types of errors/formats, but aren't terrible large Baker_2019 - 1 excluded taxon, 1 taxonomic update, substitutions Tomlinson_2019 - value out of range
In addressing #512, i noted how slow some of the tests were very slow
An initial commit identified the function
format_sites
as being very slow.Suggest running some profiling to identify the slowest parts, then see if we can optimise those