haganjam / InvTraitR

Pipeline to assign biomass-length allometry equations to a taxonomic name based on taxonomic hierarchy and geographic (or environmental) proximity
Other
3 stars 0 forks source link

something is painfully slow #20

Open black-snow opened 1 year ago

black-snow commented 1 year ago

Reading the rds files (?) is painfully slow and makes the tests take minutes to complete. Maybe we can speed things up.

I'll have to take a look at what's actually inside. Maybe we can go with, e.g., sqlite instead if it's just tabular data.

black-snow commented 1 year ago

@haganjam can you give me a rough description of what's inside the rds files? I'm not sure yet that they are the reason things are so slow - I'd have to do some profiling. Maybe it's some heavy computation instead.

Is it all just tables? If so, can't we just use CSV instead?
If computation is the bottleneck I'd have to do some digging to see where exactly and what to tweak.

MEMO: if rds is slow, maybe https://www.fstpackage.org/ can help

haganjam commented 1 year ago

The rds files are a range of different file-types actually. For example, the freshwater_ecoregion_map.rds is a spatial polygon object class in R. The ..._higher_taxon_matrices.rds files are effectively igraph objects classes. So, when they are read in, igraph as a package needs to be loaded.

I used to store the ..._higher_taxon_matrices.rds as spare matrices and then converting them to igraph objects but that ended up taking up a lot of space and it was quite slow.

The equation_database and taxon_databases are just tables so they could be stored as .csv files.

Intuitively, I doubt that the .rds extension is the problem. I think it's more likely that the tests take minutes to complete because of the bdc functions that clean and harmonise the taxon names. This generally takes quite a bit of time. In the tests, if we could only run those functions once or twice, I think it would become a lot faster.

black-snow commented 1 year ago

Alright, I'll invest a minute in profiling when I get to it.

black-snow commented 1 year ago

bdc::bdc_query_names_taxadb seems to be slow - 17-23s on my machine for one test case.

Apparently missing entries / an exhaustive search is the most expensive thing:

bdc::bdc_query_names_taxadb(sci_name=c('Triops granitica'), rank_name="Animalia", rank="kingdom")

/edit: let's see if there's an easy "fix" with https://github.com/brunobrr/bdc/issues/252 - otherwise I either have to dig into bdc or refactor, so that we can inject a test double.

black-snow commented 1 year ago

No quick fix - let's mock it except for one explicit test using the actual dependency maybe.

black-snow commented 1 year ago

Working on it in https://github.com/haganjam/FW_invert_biomass_allometry/tree/test_doubles
Tests for clean_taxon_names are now pretty much instantaneous.