README adequate? - Githubissues

jeffreyhanson commented 3 years ago

I'm still working on making sure that the benchmarking code actually works. In the meantime, @ricschuster, could you please take a look at the README and see if it makes sense? Is it missing anything? Is anything incorrect?

jeffreyhanson commented 3 years ago

Also, at the moment I assume we want to run the benchmarks with IBM CPLEX? As such, I've included the cplexAPI R package in the repository (so packrat installation will fail if IBM CPLEX isn't installed).

ricschuster commented 3 years ago

Looks great! Only made a few small changes with 221b04f

Using CPLEX in the benchmarks sounds like a good idea.

jeffreyhanson commented 3 years ago

Just to clarify, I've modified the benchmarks from your original code to resample (technically "aggregate") the planning units to varying resolutions (depending on the parameters). I thought this might provide more realistic datasets for the benchmarks? For example, a dataset containing a spatially contiguous set of planning units (e.g. no holes) would have spatial auto-correlation in costs and species distributions that might be partially obscured if we just subsampled planning units for the analysis? What do you think?

I'm hoping this is computationally feasible, but I've only been debugging the code with a small subset of the data. To improve performance, I've written some wrapper code for the gdalUtils package which can process large raster datasets much faster than the raster package (on my computer at least, probably because it only has 8GB RAM). Additionally, since any coarser-scale planning units (e.g. 5 km x 5 km) (will be assumed to) perfectly contain contain sets of 100 m x 100 m planning units, the total cost and amount of each species in each coarse-scale planning unit (e.g. 5 km x 5 km) can be calculated by summing the 100 m x 100 m planning units inside it. This means that we can calculate the costs and species data aspatially using indices and tabular data wrangling (e.g. dplyr) which is much faster than trying to this in a spatial context.

ricschuster commented 3 years ago

Oh, I see. Sorry for the oversight. I think the aggregation approach makes a lot of sense.

Happy to give it a try re: computational feasibility. I bet you are right in regards to speed and if not, we can just adjust as needed.

Exciting to see this progress.

ricschuster commented 3 years ago

Also means we can ditch 221b04f

jeffreyhanson commented 3 years ago

Ok great - thanks! I do like the fact that the subsampling approach lets us specify the total number of planning units (rather than the size) which might be more useful prioritizr users? I guess another option could be to subsample a spatially contiguous set of planning units (e.g. start from the north east corner of study area and work our way down)? But complexities in the relationship between the spatial distribution of costs and species across different parts of the study area might introduce biases in our attempt to find a relationship between "number of planning units" vs. "run time"?

jeffreyhanson commented 3 years ago

Ok - I'll undo it. Also, for future reference, could you please edit the Rmd file?

ricschuster commented 3 years ago

Ok great - thanks! I do like the fact that the subsampling approach lets us specify the total number of planning units (rather than the size) which might be more useful prioritizr users? I guess another option could be to subsample a spatially contiguous set of planning units (e.g. start from the north east corner of study area and work our way down)? But complexities in the relationship between the spatial distribution of costs and species across different parts of the study area might introduce biases in our attempt to find a relationship between "number of planning units" vs. "run time"?

We could play around with the pu size to get some reasonable pu numbers. The pu sizes would be a bit strange, but pu numbers would potentially make more sense to users?

I think if we do that, aggregating should work well?

ricschuster commented 3 years ago

Ok - I'll undo it. Also, for future reference, could you please edit the Rmd file?

Sorry about that. I just edited the README in GitHub and didn't think about the Rmd file.

jeffreyhanson commented 3 years ago

We could play around with the pu size to get some reasonable pu numbers. The pu sizes would be a bit strange, but pu numbers would potentially make more sense to users?

I think if we do that, aggregating should work well?

That's a brilliant idea!! Although we might not be able to precisely get a specific number of planning units (e.g. we might only get 1030 instead of 1000 planning units), I reckon we could get close enough so that it's not too much of an issue?

ricschuster commented 3 years ago

Agreed. Close enough would be good enough I think. It would be mostly about ballpark/order of magnitude, at least in my opinion.

jeffreyhanson commented 3 years ago

Yeah, I'm trying to think of some way to automatically calculate this, but it's complicated by the fact that not every grid cell in the raster is actually a planning unit. I think a ballpark estimate will work great!

jeffreyhanson commented 3 years ago

Since you've looked at the README and given it the OK, I'll close this issue now. Please feel free to reopen it if you can think of anything to improve it? Once we've got some "real" runs, we could perhaps adds a graph to the README or something.

prioritizr / benchmark

README adequate? #3