It takes forever to eraseoverlap for global dataset.

sahmoli commented 2 years ago

Could the code (especially the eraseoverlap part) be parrelelled in R?

jeffreyhanson commented 2 years ago

Hi @sahmoli,

Thanks for reaching out!

Yeah, I don't recommend trying to use erase overlaps with the global datset unless you really need it (and don't mind waiting ages for it to complete). For example, if you plan on rasterizing the protected area data after cleaning it, then you don't need to worry about erasing the overlapping parts of protected areas. Additionally, if you want a vector layer showing protected area boundaries without any overlaps -- and don't need the attribute data for each protected area individually -- then you can (1) clean that data without the erase overlaps step, and (2) dissolve all the protected areas togeather (removing overlaps) using wdpa_dissolve(). To achieve this, you can tell the wdpa_clean() function to skip the erase overlaps step by setting erase_overlaps = FALSE.

Also, regarding your question about paralellizing the code, I don't think it could be easily parallelized because the prcoedure for erasing overlaps is sequential. This is because the algorithm works by iteratively removing overlapping bits---so the part of a protected area that gets erased depends on what protected areas have previously been processed. For reference, the relevant code is here: https://github.com/prioritizr/wdpar/blob/master/R/st_erase_overlaps.R. If you can come up with a way of parallelizing this, I'd be happy to review a PR?

Does that help? Let me know if you have any further questions?

sahmoli commented 2 years ago

Thank you so much for the quick reply. I am doing a global analysis using wdpa. I need to calculate the cropland area within each PA, and analyze them with different attributes of PA. So that means I need the vector layer with attributes.

I runned the code for global analysis several times. It works well in the begining, but it usually has some error after two or three weeks. It only occupied 11% of CPU and 16% of RAM, so I wonder whether there is a way to accelerate it.

You are right, the procedure seems to be sequential. I have not figure out how to do it in parrelell.

Do you have any other suggestions to get a global PA after erase overlap?

jeffreyhanson commented 2 years ago

No worries!

Oh I see - do you know beforehand which PA attributes you want to analyze? E.g., if you want to examine cropland area within different IUCN categories, then you could (1) run the cleaning process with erase_overlaps = FALSE, (2) split the single WDPA dataset/object into seperate datasets/objects based on the different IUCN categories, (3) dissolve each of the IUCN category datasets seperately, (4) combine (e.g. rbind) the seperate datasets/object into a single dataset/object (ensuring that data are sorted according to how you want to deal with overlaps), and (5) then use the st_erase_overlaps() function to remove overlapping areas. This would be faster/robuster because the erase overlaps function would only iterate through one multi-polygon per IUCN category, instead of a multi-polygon for each and every single protected area.

sahmoli commented 2 years ago

Wow, super smart! Yes, I want to analyze the IUCN categories.

I will try this now, and let you know how it goes.

Thank you so much

jeffreyhanson commented 2 years ago

Excellent - great to hear we might have a solution!

Yeah, please let me know if this doesn't work and we can brainstorm some more on how to get this working?

I think I've answered all your qestions for now so, if you don't mind, could we please close this issue? I use open issues to keep track of which things on GitHub I need to respond to or focus on. Please feel free to re-open this issue, or open a new one, if you have any further questions?

sahmoli commented 2 years ago

Sounds great.

jeffreyhanson commented 2 years ago

Awesome - thanks!

jeffreyhanson commented 2 years ago

Yeah, that's what I would expect. If we dissolve an sf object, then this is spatially merging all the the data togeather (effectively the same as sf::st_union() by with extra stuff to help avoid geometry issues). So, we uf subset the global PA data to extract IUCN category "Ia" protected areas, and then dissolve that subset, we should get a new sf object with a single row that has the spatial boundaries for all category `"Ia" protected areas (with no overlaps between them). If I understand correctly, this isn't an issue though, because you're only interested in IUCN categories and not other attributes? Or maybe I'm misunderstanding something?

sahmoli commented 2 years ago

Yeah, you are right. I will let it you know how it goes. You could close it now. I guess the whole process needs hours.

jeffreyhanson commented 2 years ago

Ah ok - sounds good - thanks!

sahmoli commented 2 years ago

I realized that this method might have potential problems. In one of the following analyses, I need the variables of the status_yr and GIS_area of the PA, which are not available following the current method. If I want to keep multiple attributes of each PA after erase overlap of global PA dataset., are there any other solutions?,

jeffreyhanson commented 2 years ago

Hmm, if you want to keep track of the (i) year each PA was established and (ii) IUCN category of each PA, you could split the full dataset into multiple subsets based on different combinations of STATUS_YR and IUCN_CAT and then apply the procedure we talked about previously to dissolve and then recombine the data. Since each combination of STATUS_YR/IUCN_CAT is processed seperately, it might be possible to speed this up using parallel processing.

I can't think of a way to retain the GIS_AREA of each protected area though. Since you want to account for the overlapping bits of protected areas, maybe it would be better to manually calculate the total area of each protected area after the removing the overlaps (e.g. using sf::st_area())? Otherwise, if you use the GIS_AREA column, then that will not match the updated spatial extent of each PA after removing overlaps.

Also, since you're looking at overlap with agricultural areas, may I ask what data/resolution you're using for the agricultural areas? E.g., if the agricultural data are in raster format, then maybe there could be some way to rasterize the WDPA data at some stage in the analysis to reduce computational burden?

sahmoli commented 2 years ago

You are right, I do not need the GIS_AREA which could be calculated later. I plan to rasterize the layer after erase overlap.

After I filter the global PA dataset with GIS_AREA > 1km2 and STATUS_YR between a range, there are only 85836 obs left. It already runs for a whole day and it seems that it will finish in 2 days.

I will try the STATUS_YR/IUCN_CAT combination in parrellel in another PC, to see how long it will take. I will let you know, how it goes.

Thank you so much.

jeffreyhanson commented 2 years ago

No worries! Oh I see. Yeah, if you plan on rasterizing the WDPA data, then I don't think you need to worry about erasing overlaps? This is because the rasterization process (e.g. if using ArcGIS, raster::rasterize(), terra::rasterize(), fasterize::fasterize(), or gdalUtils::gdal_rasterize()) will assign values to the output raster based on whether or not the geometries overlap with raster cell? I guess that might not work if you need to rasterize a specific field in the WDPA layer, or compute percent coverage of each raster cell though?

sahmoli commented 2 years ago

Yes, I want to rasterize the STATUS_YR, so I need to erase overlap before rasterize. right?

jeffreyhanson commented 2 years ago

Hmm, just to check, do you want (i) a single-layer raster where each grid cell contains a value with STATUS_YR value, or (ii) a multi-layer raster (one layer for each STATUS_YR) indicating whether a PA was established within each grid cell in that year (i.e. grid cell values contain 0s and 1s)?

sahmoli commented 2 years ago

I just sent you an email ( @.*** ), could we have a voice chat on this?

On Wed, Mar 23, 2022 at 9:41 AM Jeff Hanson @.***> wrote:

Hmm, just to check, do you want (i) a single-layer raster where each grid cell contains a value with STATUS_YR value, or (ii) a multi-layer raster (one layer for each STATUS_YR) indicating whether a PA as established within the grid cell in that year (i.e. grid cell values contain 0s and 1s)?

— Reply to this email directly, view it on GitHub https://github.com/prioritizr/wdpar/issues/43#issuecomment-1075820686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXIIQ4E2JELAMBEX7EDRKTVBJZGNANCNFSM5RGJFVNQ . You are receiving this because you were mentioned.Message ID: @.***>

sahmoli commented 2 years ago

I think I want a single-layer raster where each grid cell contains a value with STATUS_YR value

jeffreyhanson commented 2 years ago

Sorry - I've been stuck in a meeting for the last while. Yeah, I can chat tomorrow - I'll respond to your email.

jeffreyhanson commented 2 years ago

Ah - yeah if you want a single-layer raster with grid cells containing STATUS_YR - then that makes things a lot easier, because we don't neccesarily have to worry overlapping protected areas. To clarify, in cases where you have multiple protected areas overlapping with a single grid cell, how should this be handled? E.g., would you want the minimum value (corresponding to earliest established protected area in the grid cell) or the maximum value (corresponding to the most recently established protected area in the grid cell)?

jeffreyhanson commented 2 years ago

I would suggest to first try using fasterize to rasterize the STATUS_YR values (https://www.rdocumentation.org/packages/fasterize/versions/1.0.3/topics/fasterize). If that doesn't work, then maybe try gdal_rasterize (https://www.rdocumentation.org/packages/gdalUtils/versions/2.0.3.2/topics/gdal_rasterize) (I find this works well for very large datasets).

jeffreyhanson commented 2 years ago

I think we resolved this issue, so I'll close this now - but please feel free to reopen it if you have any further questions.

prioritizr / wdpar

It takes forever to eraseoverlap for global dataset. #43