kratzert / Caravan

A global community dataset for large-sample hydrology
BSD 3-Clause "New" or "Revised" License
177 stars 35 forks source link

Bug affecting a handful of attributes #22

Closed kratzert closed 1 year ago

kratzert commented 1 year ago

Hi all,

while working on the code for an upcoming change, I figured out that there is a bug that affects a handful of attributes that are derived from the most downstream HydroATLAS polygon that significantly intersects with the basin polygon. This is related to #15.

There is a part in the code where we traverse the HydroATLAS polygons along the NEXT_DOWN attribute until we reach a polygon that does not significantly overlap anymore with the basin polygon. This way, we can identify the most downstream HydroATLAS polygon and use this polygon to derive a couple of attributes that are defined for the pour point rather than for the polygon. This is e.g. the total upstream reservoir volume or the river area.

The problem is, that in a different part we are removing intersecting polygons if the intersecting area is not larger than a predefined threshold (5sqkm), ignoring the size of the HydroATLAS polygon at all. There are however polygons in HydroATLAS level 12 that are themselves smaller than 5 sqkm, so even with a 100% intersection, these polygons would be removed. The problem then is that if these small polygons are removed, our algorithm to detect the most downstream polygon might fail and identifies a wrong polygon. Instead, we should also consider the overlap with the basin polygon (as also suggested by @jonschwenk in #15), when filtering out intersecting polygons. If we only look at the percentage of overlap though, we might run into different problems for small basins that e.g. only intersect with less than half of a single HydroATLAS polygon. Therefore, we will apply both filterings together.

I am currently working on adapting the code accordingly, then I will update the dataset with new attribute files and also reach out to the authors of the 3 extensions.

The affected attributes are

pour_point_properties = ['dis_m3_pmn', # natural discharge annual mean
                         'dis_m3_pmx', # natural discharge annual max
                         'dis_m3_pyr', # natural discharge annual min
                         'lkv_mc_usu', # Lake Volume
                         'rev_mc_usu', # reservoir volume
                         'ria_ha_usu', # River area
                         'riv_tc_usu', # River volumne
                         'pop_ct_usu', # Population count in upstream area
                         'dor_pc_pva', # Degree of regulation in upstream area
                        ]
kratzert commented 1 year ago

The fix was submitted. Only the attributes listed above and only for a handful of basins were affected. However, we do recommend to update the dataset as soon as the new version finished uploading.

Here are a few scatter plots of the old vs new attributes for three different attributes:

Only the third attribute should be (an is) affected and only for basins with certain edge cases. Below you find the scatterplots grouped by each sub-dataset.

Screenshot from 2023-05-16 21-31-14

Screenshot from 2023-05-16 21-31-29

Screenshot from 2023-05-16 21-31-47

Screenshot from 2023-05-16 21-32-10

Screenshot from 2023-05-16 21-32-29

Screenshot from 2023-05-16 21-32-46

Screenshot from 2023-05-16 21-33-13

As you can see, most basins are not affected but for some, there is a difference. The difference in this case will always be that the new attribute accounts for a larger area, therefore in the case of ria_ha_usu the new attribute value should be (and is) strictly larger.