project-gen3sis / R-package

Repository of the R-packageGen3sis
https://project-gen3sis.github.io/R-package/
GNU General Public License v3.0
29 stars 9 forks source link

potential improvement of distance matrix calculation #60

Open ohagen opened 2 years ago

ohagen commented 2 years ago

Joe at Uni Bristol looked into ways of improving distance storage and calculations. I past here some text. Problem: I have sets of global climate data at a reasonably high spatial resolution (720x360), but when I first tried out gen3sis with these landscape grids, I quickly ran into vector size allocation problems. I've been running my simulations at a lower resolution (180x90) with great results. Possible general solution: Ways in which the simulation capacity could be boosted and I found a potential solution in the tools I was developing. Description: In short, it relies on representing the gridded landscape as a graph of local connections which needs much less storage space than a full sparse distance matrix for a landscape grid of a given size. Any pairwise distances that are needed in the other internal functions can then be found very quickly using the cppRouting R package, maintaining the efficiency of the simulation loop at run time. The graph method also has a few other potential benefits, which are what led me onto the problem of working with gridded landscapes in the first place I am still figuring out a few issues with my implementation (in particular, I am struggling with maintaining the optimality of the method in the tbdscan C++ functions), but if this would potentially be of interest to you, I would be very happy to send over some of the code I have put together. I have tried to make the edits as compatible with the current structure of gen3sis as possible, but they are still non-trivial so I thought it might be best to contact you via email, rather than submitting any edits via Github or similar.

ohagen commented 2 years ago

Joe, thank you very much!!!! Question, would this involve changing the landscape distance object? Are there further dependencies besides cppRouting?

benj919 commented 2 years ago

I agree that the current implementation of the distance matrices is the part of the simulation with the worst scaling behaviour.

Part of this idea has already been implemented. The distances can be saved as a sparse pair-wise matrix (local_distances). Currently we can either save the local distances or the full all-to-all distances. We can calculate limited distance matrices during a simulation run only (for the moment). This will be updated to allow pre-computing those partial distance matrices "soonish".

Calculating pair-wise distances on demand will have tremendous impacts on performance and/or require a quite elaborate caching scheme. The distance matrices are used for different things, obviously in the dispersal, but also for the clustering in both the evolution and speciation functions.

Switching over to such a scheme would completely replace the tdbscan function calls and "should" work with the cpp implementation for the distances by setting all inhabited cells as starting cells for the distance calculations.

It should be "straight forward" to change those implementations over to a graph based approach, but the computational overhead is going to be tremendously expensive as those calculations will have to either be cached or repeated multiple times for each species in the simulation.

If you (Joe) are interested in exploring this in detail I suggest we do a direct call to discuss this upfront?

ohagen commented 2 years ago

Tks Benji for jumping into this. We should have a meeting soon. I will let you know when.