lynker-spatial / hfsubsetCLI

Hydrofabric Subsetter
https://www.lynker-spatial.com/hydrofabric/hfsubset/__docs__/
GNU General Public License v3.0
8 stars 6 forks source link

catchment weights #28

Closed JordanLaserGit closed 8 months ago

JordanLaserGit commented 10 months ago

In order to create nextgen forcing files, weights (indices) are calculated from the hydrofabric and national water model files. Currently, this is done by rasterizing each catchment, which is a time consuming process. Would subsetting from a conus_weight.json file be a better solution? Would hfsubset be a good place to do this? These weights might change with each hydrofabric release, so it may be nice to have the weights computing automatically along with the geopackages.

program-- commented 10 months ago

@JordanLaserGit I can't say for certain whether this would fit into hfsubset or some other utility, but this is a good place to start that discussion since it sounds like a reasonable need. I'm in the process of doing a small rewrite of the hfsubset service to an actual API to address concurrency issues (which is why there's been a lack of updates).

I haven't had the chance to generate forcing files, only use them. Would you be able to give me an example of the weights files?

JordanLaserGit commented 10 months ago

https://ngenresourcesdev.s3.us-east-2.amazonaws.com/01_weights.json https://ngenresourcesdev.s3.us-east-2.amazonaws.com/nextgen_01.gpkg

@program-- Here's a weight file I generated a few months ago, along with the geopackage it came from. v20 hydrofabric I believe. This isn't an immediate need, just figured these weight files should be standardized and live somewhere the community can access readily. If you're curious to see the implementation in the forcing generation, you can find that here. https://github.com/CIROH-UA/ngen-datastream/tree/main/forcingprocessor

program-- commented 10 months ago

@JordanLaserGit Is this file generating the weights? If so, I think it's possible to integrate something similar on the service-side (or maybe client-side? since that'll let the user pick what forcing to use?).

For now, let's keep this issue open in case of further discussion, and I'll note the requirement as an addition to the hfsubset API service. 😄

JordanLaserGit commented 10 months ago

@program-- Yup that'll create the weights. Assuming we do something like run that script once on conus, and then subset that file, then the speed of that script probably isn't too big of an issue. Though if we want that script to run upon user request, we may want to optimize it. CONUS took something like 18 hours (with no parallelization whatsoever).

Sounds good, thanks for taking a look at this!

JordanLaserGit commented 10 months ago

Ask and ye shall receive! @JoshCu just sped things up. https://github.com/CIROH-UA/ngen-datastream/pull/36

JoshCu commented 10 months ago

In theory a further 2.5x speedup is possible using geometry_windows to raster only a subset of the whole region. But I'm having some interesting issues with transforms. depending on if I open the source netcdf with rasterio or with xarray the y axis is flipped and there's some rounding error by the looks of it? Entirely possible I'm doing something weird though, I don't have much experience with this. code to generate incorrect weightings quickly

rasterio using netcdf4 (i think)

| 1000.00, 0.00,-2303999.25|
| 0.00,-1000.00, 1919999.62|
| 0.00, 0.00, 1.00|

xarray definitely using netcdf4

| 1000.00, 0.00,-2303999.25|
| 0.00, 1000.00,-1920000.38|
| 0.00, 0.00, 1.00|
Starting at 2024-01-11 13:58:41.329457
Finished at 2024-01-11 13:58:51.810312
Total time: 0:00:10.480928

current weights

"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}

weights with rasterio transform (y axis flip has been corrected for)

"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}

The column indices are all shifted by 11 but otherwise the result is the same

If those rasterio weightings are correct then it only takes ~10 minutes to create conus weightings

program-- commented 10 months ago

@JoshCu for my own purposes, I rewrote the weights generator in R

and to better integrate into hfsubset's subsetting code... though if I throw it on the client-side, I guess I'll have to rewrite it again in Go

R version ```r #!/usr/bin/env Rscript gpkg_path <- "..." grid_path <- "..." output <- "..." # microbenchmark::microbenchmark({ grd <- terra::rast(paste0( "netcdf:", grid_path, ":RAINRATE" )) gpkg <- terra::vect(gpkg_path, "divides") |> terra::project(terra::crs(grd)) weight_rast <- terra::rasterize( x = gpkg, y = grd, field = "divide_id", background = NA_character_, touches = TRUE ) |> terra::flip(direction = "vertical") weights <- terra::as.data.frame(weight_rast, cells = TRUE) |> dplyr::mutate( row = terra::rowFromCell(weight_rast, cell), col = terra::colFromCell(weight_rast, cell) ) |> dplyr::select(-cell) |> dplyr::arrange(divide_id, row, col) split(weights, weights$divide_id) |> lapply(FUN = function(part) { setNames(as.list(part[, c("row", "col")]), c()) }) |> jsonlite::write_json(output) # }, times = 5L) ```

Which seems to have a decent speed up if I'm interpreting your results correctly.

Unit: seconds
      min       lq    mean   median       uq      max neval
 4.356602 4.376032 4.75269 4.406561 4.420441 6.203815     5

for VPU 01. However, I'm experiencing something similar to your issue, where my coordinates are (seemingly) slightly off. rasterio uses GDAL under the hood, just like terra in my case (hence why I needed to flip my raster too), and I know that xarray (particularly rioxarray) and GDAL perform operations differently, so that might be the root of the issue. Whether the weights are correct or not... well I guess that's up to @JordanLaserGit or someone else haha

I've attached my example weights output from R here: weights-r.json

JoshCu commented 9 months ago

@program-- I'm just running bash time and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell. Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?

VPU 01 on 56 Cores ~12.5s I had no idea you have folding sections in md, I'll be overusing them from now on ``` bash Starting at 2024-01-11 22:12:28.170569 Time spent opening files 0:00:00.344885 Total processing time: 0:00:08.741575 Total writing time: 0:00:00.318865 Total time: 0:00:11.566039 real 0m12.570s user 6m55.572s sys 0m12.635s ```

We do seem to have different weightings outputs, I'm not sure the best way to diff two large json files is but they're different sizes and grep spot checks show different results.
image

xarray transform

"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]

rasterio transform

"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]

terra transform

"cat-18":[[2424,2425,2425,2425,2426,2426,2426,2426,2426,2427,2427,2427,2427,2427,2428,2428,2428,2428,2428,2428,2428,2429,2429,2429,2429,2429,2430,2430,2430,2431,2432],[4324,4324,4325,4326,4323,4324,4325,4326,4327,4324,4325,4328,4329,4330,4323,4324,4325,4326,4327,4328,4329,4326,4327,4328,4329,4330,4328,4329,4330,4329,4329]]

It would be easier to see if I could figure out how to wrap lines in code blocks, I seem to have double the number of coordinates for the catchments Maybe the all touched implementation is different too?

RTI have a weightings generator we could compare to, but I think I'll leave that to someone who's understanding extends beyond my insightful "the numbers are different" comments.
weights from nextgen_01.gpkg -> weights_01.json

hellkite500 commented 9 months ago

@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)

program-- commented 9 months ago

@program-- I'm just running bash time and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell. Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?

...

Yeah VPU 01 is that link you sent (err, I might have used v20.1, but they should be effectively the same for this...). The size difference is most likely from the formatting between the JSON files, minifying has a drastic effect sometimes, and jsonlite does it by default.

It's interesting to see that all three generators have different results 😅 maybe @mikejohnson51 can weigh in (haha) to verify if my R implementation is correct since he's better with gridded data than I am. (! if you have time Mike)

@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)

How dare you, I did it by hand 😛 but that would be useful, I didn't know you could generate weights with exactextract actually haha, so I'll have to check that out. I've only used it a bit.

mikejohnson51 commented 9 months ago

Hi all!

Yes we use to deliever the catchment weights for a few forcing products with each release as a forcing_weights.parquet file. The code to do this is still around. We did this with a slightly modified version of exactexract(used in zonal). If you want to use Python, check out the extactextract bindings that Dan Baston is working on through NASA (as @hellkite500 mentioned) or gdptools by Rich McDonald (@rmcd-mscb). In R (as @program--) did, terra can work but zonal and exactextractr will be faster and more accurate.

The reason it is more accurate is that it takes into account partial coverages (e.g. a polygon covers 1/2 a cell).

We stopped sharing those as the forcing workstream became more mature and all groups (forcing, model engine, ect) wanted to do it themselves.

Overall, I am not opposed to reintroducing the weights that would be accessiable with hfsubset(as a layer) if it is of high value and we can determine one (or two) primary datasets. Is AORC the main dataset of interest @JordanLaserGit?

JordanLaserGit commented 9 months ago

The short term goal is to get daily ngen outputs via ngen-datastream. In this, a forcings engine takes in the weights as an input. I've already generated the weights for v2.1 and we will re-use the same file until the next hydrofabric version comes out. So having a conus weights file that could be subsetted, would be the idea dataset. Not an pressing need though, as ngen-datastream is still under development.

The primary motivation here is standardizing the catchment weights. We can all agree on using the same tool (like exactextract, but seems like the most secure way would be to generate the file once and have it live somewhere everyone can reference. Otherwise we may be setting people up for the problem of comparing two research projects over the "same" catchment may yield different results just because weight generation wasn't consistent.

mikejohnson51 commented 9 months ago

Makes perfect sense to me. Remember though (and I'm sure you got this) a weights file will only be valid for one grid (same grid size and crs).

We can add this to the releases. Would you mind adding an issue to the Hydrofabric repo to track this with a pointer to a sample forcing file?

JoshCu commented 9 months ago

@hellkite500 Exact extract looks great! I didn't realise quite how low resolution the netcdf's are compared to the water basin geometries until I overlaid them just now. Do you know if there's a recommended method to parallelise it? I think it's caching the % of each pixel covered by the geometries so I've been chunking it by geometry rather than by time to prevent recalculating the polygons. I'm just using python's multiprocessing pool as it's what I'm familiar with, but probably not the best way of doing it? image

mikejohnson51 commented 9 months ago

See here for a conus weights parquet file: s3://lynker-spatial/v20.1/forcing_weights.parquet