Raster CRS and resolution problems?

oharac commented 5 years ago

More an issue for @Melsteroni - @bshalpern can probably skip for now...

For this project, I have been assuming that the resolution I used for the spp/biodiversity risk project would be a fine native resolution at which to calculate overlap of biodiversity risk and cumulative impacts. The 1 km Mollweide CRS we use for cumulative impacts is obviously great at the habitat/impact scale, but most native stressor layers are not at that fine resolution. The species ranges are polygons so we can rasterize them at whatever resolution we like, but 1 km resolution implies a precision to those ranges that is not really defensible. Note that for calculating impacts of stressors on habitats, and therefore CHI, the finer resolution is defensible and important, but probably way overkill for this project. So 10 km x 10 km seems fair, and far more analytically tractable than the native 1 km Mollweide.

I've got some scripts that reproject stressors to the 10 km Gall-Peters from the 1 km Mollweide, that start by reprojecting the GP cell IDs to Mollweide, and then using those cell IDs to aggregate the Mollweide values (mean for now, but other aggregations are easy) to populate the GP cells. I noticed a weird effect on the right hand side:

which results in the notch on the side of the reprojected stressor layers:

Digging deeper, I think this is an artifact from the projection of Gall-Peters to Mollweide (the cell ID raster). WGS84 to Mollweide seems to work fine. But GP to WGS84 to Moll is two reprojections, including an intermediate non-equal area, so I think a lot of info would be lost.

One solution might be: aggregate the ~1 km Mollweide CHI maps by some factor (5x might be adequately tractable, so about 4.7 km on a side; 8 or 10x would be easier to analyze but less accurate reprojection; 11x would be closest to resolution of the BD risk original project) then rerun all the biodiversity maps in that larger Mollweide resolution, and do the calculation that way.

@Melsteroni thoughts on all of this?

BTW: I've got some thoughts on using matrices to do cool calculations, haven't benchmarked them against just laying rasters on other rasters, but I think it might be faster. However, not sure how well those techniques will work with Mollweide because of all that dead space in the corners.

oharac commented 5 years ago

OK, I redid the whole biodiversity risk project with updated species ranges and all in Mollweide projection at a ~100km^2^ resolution (aggregated the native CHI data by a factor of 11x). There were some annoying problems along the way - including poorly constructed IUCN range maps that when reprojected to Mollweide had major issues. I fixed those by using smoothr::densify to add vertices to the polygons, so the reprojection had something to work with.

Anyway, looks like the methods are working more or less, currently just using faked stressor weights for each species on each stressor, but will start pulling some real weights from IUCN threat analyses.

Would be good to meet soon to show what I'm doing and get some feedback. @bshalpern do you have some availability in the next couple of weeks?

Melsteroni commented 5 years ago

Thanks for the update! I am super excited to hear more! I am free almost anytime next week (9-13). I'll be gone for most of 17th and all of the 18th...but otherwise around that week.

I agree that a coarser resolution is better than 1km for this project (both in terms of being a more honest depiction of accuracy and analytical speed). That is very strange about the projections. All reprojections are approximations and can cause weirdness...but that seems particularly strange.

With your method: did the deadspace at the corners of the mollweide project cause analytical weirdness?

oharac commented 5 years ago

re: reprojection, I didn't figure out why the problem was occurring... it's as if the Gall-Peters rasters don't go all the way to +180 (as an equal area projection, the units aren't degrees, so tough to tell) - but visually it looks fine.

Anyway. For the Mollweide, I'm reading in the rasters, turning into value vectors for either dataframe or matrix operations, and keeping only those cells that match the ocean area raster - this drops all the NA cells (both corners and land), making the matrices and dataframes 40% smaller and easier to work with. Then to recreate the rasters after analysis, I can just stick the resulting values back into the the ocean area raster, in the same order, and voilà! instant map.

The main calc I'm trying to do with this benchmarking is CHI = diag(X(WY)^T), where X is a matrix of species presence (rows = species IDs, cols = cell IDs, value = 1 or 0, i.e. maps for all species), W is a matrix of species sensitivity to stressors (rows = species IDs, cols = stressors, value = sensitivity), and Y is matrix of stressor level (rows = stressor, cols = cell IDs, value = level of that stressor in that cell, i.e. all stressor maps). The WY effectively results in global maps of the sensitivity-weighted sum of stressors for each species (i.e. a map of potential cumulative impacts on each species instead of habitat, potential because does not regard species presence, just stressor levels multiplied by species sensitivity). Pre-multiplying by X sums the potential cumulative impacts only for cells where the species is present. The diagonal is the cumulative impacts for each species on that species' own range; the off-diagonals would indicate the stressor effects (weighted for one species) on other species present, which I don't think makes any useful sense.

The result CHI is a vector of the sum of species-specific CHI across each species' entire range. Not sure this makes sense as a metric, as it loses all spatial information and (all else equal) will result in higher impacts on larger-ranged species. But summarizes nicely all the inputs into a single number for quick comparison.

Trying to benchmark the matrix operations against the dataframe operations, with a sample of 30 species and a rough pass at crosswalking IUCN threat levels as sensitivity to CHI stressors... since we have 7000+ maps (as of version 2019-2) these timing differences will be useful to understand.

the species map matrix is slower to create (3.4 s for matrix vs. .26 s for dataframe) because the spp maps are saved as dataframe-ready .csv files, about a 15x difference, which will scale with the number of species we run.
the stressor map matrix is slower to create (26 s for matrix vs. 13 s for dataframe). This doesn't need to scale - this is all the stressors already.
for calculations, the matrix method is much faster, even with the data.table::merge() - this will also scale with the number of species run.
- 5.3 s for matrix vs. 8.7 s for dataframe
- and impressively, 6.3 seconds for matrix vs. 306 seconds for dataframe, a 50x improvement.
the results are identical except for one species out of 30, which is really weird - I don't yet know why the discrepancy or which (if either) is "correct"...

So basically the savings on matrix operations in the last step make me want to go that route. I think I have a solid method nearly ready to go once I figure out the weird discrepancy!

oharac commented 5 years ago

discrepancy solved - I had filtered the maps for the matrix to exclude "extinct" and "possibly extinct" regions, but didn't do the same for dataframe. Filtering the dataframe the same way, I get the same results. Yay!

bshalpern commented 5 years ago

not clear if you want/need me to weigh in on any of this; if so I would probably need to meet up to talk through it all in more detail to make sure I fully understand things.

oharac commented 5 years ago

Thanks @bshalpern - this is largely me documenting some of the technical issues along the way, throwing a few technical questions toward @Melsteroni, and incidentally letting you and @Melsteroni know where I'm at. I do have some higher-level questions that need to be hammered out, so I'll set up a meeting next week where you and Mel and I can chat about some of the big picture ideas.

oharac / bd_chi

Raster CRS and resolution problems? #3