Closed JordanLaserGit closed 8 months ago
@JordanLaserGit I can't say for certain whether this would fit into hfsubset
or some other utility, but this is a good place to start that discussion since it sounds like a reasonable need. I'm in the process of doing a small rewrite of the hfsubset service to an actual API to address concurrency issues (which is why there's been a lack of updates).
I haven't had the chance to generate forcing files, only use them. Would you be able to give me an example of the weights files?
https://ngenresourcesdev.s3.us-east-2.amazonaws.com/01_weights.json https://ngenresourcesdev.s3.us-east-2.amazonaws.com/nextgen_01.gpkg
@program-- Here's a weight file I generated a few months ago, along with the geopackage it came from. v20 hydrofabric I believe. This isn't an immediate need, just figured these weight files should be standardized and live somewhere the community can access readily. If you're curious to see the implementation in the forcing generation, you can find that here. https://github.com/CIROH-UA/ngen-datastream/tree/main/forcingprocessor
@JordanLaserGit Is this file generating the weights? If so, I think it's possible to integrate something similar on the service-side (or maybe client-side? since that'll let the user pick what forcing to use?).
For now, let's keep this issue open in case of further discussion, and I'll note the requirement as an addition to the hfsubset API service. 😄
@program-- Yup that'll create the weights. Assuming we do something like run that script once on conus, and then subset that file, then the speed of that script probably isn't too big of an issue. Though if we want that script to run upon user request, we may want to optimize it. CONUS took something like 18 hours (with no parallelization whatsoever).
Sounds good, thanks for taking a look at this!
Ask and ye shall receive! @JoshCu just sped things up. https://github.com/CIROH-UA/ngen-datastream/pull/36
In theory a further 2.5x speedup is possible using geometry_windows to raster only a subset of the whole region. But I'm having some interesting issues with transforms. depending on if I open the source netcdf with rasterio or with xarray the y axis is flipped and there's some rounding error by the looks of it? Entirely possible I'm doing something weird though, I don't have much experience with this. code to generate incorrect weightings quickly
| 1000.00, 0.00,-2303999.25|
| 0.00,-1000.00, 1919999.62|
| 0.00, 0.00, 1.00|
| 1000.00, 0.00,-2303999.25|
| 0.00, 1000.00,-1920000.38|
| 0.00, 0.00, 1.00|
Starting at 2024-01-11 13:58:41.329457
Finished at 2024-01-11 13:58:51.810312
Total time: 0:00:10.480928
"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}
"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}
The column indices are all shifted by 11 but otherwise the result is the same
If those rasterio weightings are correct then it only takes ~10 minutes to create conus weightings
@JoshCu for my own purposes, I rewrote the weights generator in R
and to better integrate into hfsubset's subsetting code... though if I throw it on the client-side, I guess I'll have to rewrite it again in Go
Which seems to have a decent speed up if I'm interpreting your results correctly.
Unit: seconds
min lq mean median uq max neval
4.356602 4.376032 4.75269 4.406561 4.420441 6.203815 5
for VPU 01. However, I'm experiencing something similar to your issue, where my coordinates are (seemingly) slightly off. rasterio
uses GDAL under the hood, just like terra
in my case (hence why I needed to flip my raster too), and I know that xarray
(particularly rioxarray
) and GDAL perform operations differently, so that might be the root of the issue. Whether the weights are correct or not... well I guess that's up to @JordanLaserGit or someone else haha
I've attached my example weights output from R here: weights-r.json
@program-- I'm just running bash time
and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell.
Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?
We do seem to have different weightings outputs, I'm not sure the best way to diff two large json files is but they're different sizes and grep spot checks show different results.
"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]
"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]
"cat-18":[[2424,2425,2425,2425,2426,2426,2426,2426,2426,2427,2427,2427,2427,2427,2428,2428,2428,2428,2428,2428,2428,2429,2429,2429,2429,2429,2430,2430,2430,2431,2432],[4324,4324,4325,4326,4323,4324,4325,4326,4327,4324,4325,4328,4329,4330,4323,4324,4325,4326,4327,4328,4329,4326,4327,4328,4329,4330,4328,4329,4330,4329,4329]]
It would be easier to see if I could figure out how to wrap lines in code blocks, I seem to have double the number of coordinates for the catchments Maybe the all touched implementation is different too?
RTI have a weightings generator we could compare to, but I think I'll leave that to someone who's understanding extends beyond my insightful "the numbers are different" comments.
weights from nextgen_01.gpkg -> weights_01.json
@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)
@program-- I'm just running bash
time
and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell. Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?...
Yeah VPU 01 is that link you sent (err, I might have used v20.1, but they should be effectively the same for this...). The size difference is most likely from the formatting between the JSON files, minifying has a drastic effect sometimes, and jsonlite
does it by default.
It's interesting to see that all three generators have different results 😅 maybe @mikejohnson51 can weigh in (haha) to verify if my R implementation is correct since he's better with gridded data than I am. (! if you have time Mike)
@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)
How dare you, I did it by hand 😛 but that would be useful, I didn't know you could generate weights with exactextract actually haha, so I'll have to check that out. I've only used it a bit.
Hi all!
Yes we use to deliever the catchment weights for a few forcing products with each release as a forcing_weights.parquet
file. The code to do this is still around. We did this with a slightly modified version of exactexract
(used in zonal
). If you want to use Python, check out the extactextract
bindings that Dan Baston is working on through NASA (as @hellkite500 mentioned) or gdptools
by Rich McDonald (@rmcd-mscb). In R (as @program--) did, terra
can work but zonal
and exactextractr
will be faster and more accurate.
The reason it is more accurate is that it takes into account partial coverages (e.g. a polygon covers 1/2 a cell).
We stopped sharing those as the forcing workstream became more mature and all groups (forcing, model engine, ect) wanted to do it themselves.
Overall, I am not opposed to reintroducing the weights that would be accessiable with hfsubset
(as a layer) if it is of high value and we can determine one (or two) primary datasets. Is AORC the main dataset of interest @JordanLaserGit?
The short term goal is to get daily ngen outputs via ngen-datastream. In this, a forcings engine takes in the weights as an input. I've already generated the weights for v2.1 and we will re-use the same file until the next hydrofabric version comes out. So having a conus weights file that could be subsetted, would be the idea dataset. Not an pressing need though, as ngen-datastream is still under development.
The primary motivation here is standardizing the catchment weights. We can all agree on using the same tool (like exactextract
, but seems like the most secure way would be to generate the file once and have it live somewhere everyone can reference. Otherwise we may be setting people up for the problem of comparing two research projects over the "same" catchment may yield different results just because weight generation wasn't consistent.
Makes perfect sense to me. Remember though (and I'm sure you got this) a weights file will only be valid for one grid (same grid size and crs).
We can add this to the releases. Would you mind adding an issue to the Hydrofabric repo to track this with a pointer to a sample forcing file?
@hellkite500 Exact extract looks great! I didn't realise quite how low resolution the netcdf's are compared to the water basin geometries until I overlaid them just now. Do you know if there's a recommended method to parallelise it? I think it's caching the % of each pixel covered by the geometries so I've been chunking it by geometry rather than by time to prevent recalculating the polygons. I'm just using python's multiprocessing pool as it's what I'm familiar with, but probably not the best way of doing it?
See here for a conus weights parquet file: s3://lynker-spatial/v20.1/forcing_weights.parquet
In order to create nextgen forcing files, weights (indices) are calculated from the hydrofabric and national water model files. Currently, this is done by rasterizing each catchment, which is a time consuming process. Would subsetting from a conus_weight.json file be a better solution? Would hfsubset be a good place to do this? These weights might change with each hydrofabric release, so it may be nice to have the weights computing automatically along with the geopackages.