parallelize agb calculations

cpiponiot commented 4 years ago

@ValentineHerr I think you already mentioned this but now with the size of the equation table (550 equations), there can be some memory problems when running the get_biomass function for very large datasets (appr. > 20 000 observations on my computer). I've tried to optimize the function as much as possible but we need to keep a lot of information until the final calculation (all the weights, etc), so the only way I see around this issue right now is to parallelize the agb calculation. This can be done outside of the function by the user, or within the function. It will take a little more time to implement and this isn't a top priority if we want to have something ready by September, but I wanted to check with you @gonzalezeb @teixeirak @ValentineHerr if you think it could be useful at all. Otherwise we can just add an example of how to parallelize the calculation in the function description file.

teixeirak commented 4 years ago

I think that the example in the function description file is sufficient; more important to get this done than to perfect it. :-)

ValentineHerr commented 4 years ago

telling the user that they might need to parallelize if they have a large data set is fine with me. I ended up spiting my data into 10 chunks and running the function separately for each of them.

gonzalezeb commented 4 years ago

I'm not very sure but maybe one of the problem with memory is related with that large raster layer (koppenRaster) that is part of the get_biomass function. Why do we need that if we can get koppen zones using the R package?

cpiponiot commented 4 years ago

I can try using the R package instead, to see if it improves it, but I think the main problem is the weight matrix (500 equations * N observations)

gonzalezeb commented 4 years ago

It is much faster now, last night I couln't run test 3 from here and now it only took 1-2 minutes

ValentineHerr commented 4 years ago

I think there may be something broken now?

I get this issue:

Error in equation_id %in% equations_ids : object 'equation_id' not found

ValentineHerr commented 4 years ago

never mind, I restarted my session and now it works...

cpiponiot commented 4 years ago

I added an example of this in the get_biomass() description file

ropensci / allodb

parallelize agb calculations #111