HydroAtlas attributes aggregation issues

kratzert / Caravan

A global community dataset for large-sample hydrology

BSD 3-Clause "New" or "Revised" License

174 stars 34 forks source link

HydroAtlas attributes aggregation issues #15

Closed jonschwenk closed 1 year ago

jonschwenk commented 1 year ago

I was comparing the output of the Caravan Part 1 code (HydroATLAS attribute aggregation) against my own version where I process everything locally without GEE. I have found some discrepancies and will explain them here.

Non-accumulated discrepancies

pop_ct_ssu: According to the BasinAtlas Catalog, this is a summed variable per-basin, so its aggregation strategy should be a sum as well, but Caravan is taking the mean.
gdp_ud_ssu: Same as pop_ct_ssu but there is also a gdp_ud_sav so you could eliminate this one.

Accumulated discrepancies

The largest discrepancies by far were for accumulated/pour point attributes. These are tricky to get right since many of the accumulated variables have large step-changes as we move downstream, so tributaries/junctions become problematic. If I understand correctly, the Caravan approach takes the HydroBasin with the largest up_area that meets some overlap threshold area as the "downstream basin." It's a reasonable approach but I think it's important that users know the possible limitations. Here is an illustration for the reservoir volume attribute (rev_mc_usu):

The small polygons are HydroATLAS basins; the red polygon is a watershed boundary delineated from MERIT-Hydro. HydroATLAS polygons are colored by their rev_mc_usu value, which is 0 for most polygons and large for a handful. The target watershed should have a rev_mc_usu value of 0, but the Caravan code (with the default MIN_OVERLAP_THRESHOLD=5) returns 99068. The figure above shows why that is--the MERIT basin delineation overlaps one of the HydroATLAS basins that isn't part of the intended watershed just enough (i.e. more than 5km^2) that it is sampled as the downstream basin.

The problem is that the MIN_OVERLAP_THRESHOLD thus needs to be specified for each basin to avoid this mistake for accumulated attributes. That's not really feasible, since a user would have to compare each of their basins with the HydroATLAS basins to determine if there are unintended overlaps and then set the threshold. I don't have a good solution to propose for the Caravan/GEE approach, but if it's helpful, I have a simple solution that works pretty well for my local (non-GEE) processing:

Compute the fraction of each HydroATLAS polygon that is within the watershed polygon.
Select the HydroATLAS polygon with the largest value from 1) as the first possible basin. This ensures that we are starting with a basin that is definitely within the watershed polygon.
"Walk" downstream using the next_down attribute of the possible_basin
For each "step" (i.e. new start_basin) that is taken, check that its value from 1) is above some threshold (I suggest 0.75 based on my testing). This ensures that the possible_basin is indeed a part of the target watershed polygon.
Quit when the condition for 4) is not met. The last possible_basin is then the downstream-most basin from which to sample the accumulated attribute.
(Some handling of cases where no basins meet 2) is required.)

Here's my code that does this:

    this_idx = np.argmax(df['frac_basin_cover'].values)
    possible_basin = df['hybas_id'].values[this_idx]
    while 1:
        this_basin = df['next_down'].values[this_idx]
        this_idx = np.where(df['hybas_id'].values==this_basin)[0]
        if len(this_idx) == 0:
            break
        else:
            this_idx = this_idx[0]

        if df['frac_basin_cover'].values[this_idx] < 0.75:
            break
        else:
            possible_basin = this_basin

I think something like this could be implemented in GEE. I have tested this for ~1000 watersheds across the Arctic against the Caravan implementation. Here's a snapshot of the largest discrepancies: v_mean is my method (VotE), c_mean is Caravan. For these mean values, I have averaged the variable's value across all ~1000 watersheds. The errors are percentages (v_mean-c_mean)/v_mean). While I can't say that all the attributes' discrepancies are due to the above issue, the four that I have looked at in detail are.

Using percent errors as I have isn't a great method, since it scales with the range of the particular variable (e.g. in the above example, the range is from 0 to 99068 so the percent error is huge), but it highlights the variables that disagree between the two methods. To check this, here is a histogram of discrepancies for the rev_mc_usu attribute between VotE and Caravan: The overwhelming majority of watersheds agree perfectly, but a non-insignificant fraction have very large discrepancies. I manually checked five of these large discrepancies, and each one was due to the above issue.

Anyway, I was hoping to contribute to the Caravan collection with Arctic watersheds but I would rather use my local aggregation methods since it seems to handle finding the downstream HydroATLAS basin more reliably. I think users of Caravan datasets should be aware of this possible issue in accumulated HydroATLAS attributes--while the issue is infrequent, it can make an enormous difference in the returned attributed value!

kratzert commented 1 year ago

Hey @jonschwenk

thanks for opening this issue. I was (to some degree) aware of this issue and the entire attributes-from-HydroATLAS stuff is attached to some uncertainties in general. I am currently on a business trip and are partially in holidays next week but I will have a close look into this once I return to normal work-days.

That being said: Caravan is not yet officially published, so we are able to change our code and recompute the attributes without any problems. If we find a method (e.g. your proposal) that gives reliably better estimates than I would definitely be in favor of changing the method of finding the "most downstream polygon". So in this regard, you already contributed to Caravan :smile:

Regarding the Artic, that would certainly be a valuable data contribution although one should be careful with HydroATLAS in these regions, as the quality of HydroSHEDS (and hence HydroATLAS) degrades a lot above 60° northern latitude (because of the underlying DEM they use, see official docs for details).

kratzert commented 1 year ago

I had a quick look into this and although your algorithm works in finding a potential "most downstream polygon", the very first random basin I tested showed that this approach is not free of mistakes. See the example below, where white polygons are from HydroATLAS, the rose polygon is the catchment polygon, and the lines are from HydroRIVERS, so we can better identify the outlet. Additionally I gave three HydroATLAS polygons a different outline color to make it easier to explain.

Screenshot from 2022-11-03 11-45-02

If we follow your approach, starting from the intersecting polygon with he largest overlap and the going down the river following the NEXT_DOWN attribute, the resulting "most downstream gauge" HydroATLAS polygon would be the one with the red outline. However, the intersection between the catchment area and this polygon is quite small (smaller than 5 sqkm and also below your << 75% intersection threshold, which, similarly to the 5 sqkm is another randomly chosen threshold), this polygon would not be considered. So the previous polygon would be picked as most-downstream polygon, which in this case would be the one depicted by the blue outline color. This however means that we would ignore what happens in the polygon with the green outline (which currently happens as well). If we would pick the polygon with the red outline instead, we could have the problem that their is a reservoir in the non-intersecting part, which would then be considered to be part of the actual catchment polygon (I know this happened in a few places, this is why we have this algorithm in the first place).

Since you seem to work with the underlying datasets yourself, you will know that this is a non-trivial problem to solve. Probably the best would be to not rely on HydroATLAS in the first place but instead to derive the attributes ourselves from raster data. This however was not the idea of Caravan. HydroATLAS v2 (released probably next year) could be another solution (or at least making the problem smaller).

I need to think more about this and probably discuss this with some of my colleagues. I keep you updated.

jonschwenk commented 1 year ago

I love that the first example you looked at was a case where my proposed algorithm would fail 😅 -- I did not come across that case in the ~10 I looked at, but it is a good counterexample. I think there's a relatively easy fix for this as well--I'll drop it here and you tell me if you can imagine any reasonable failure cases for it:

and here is the corresponding Python code:

# df['coverage'] is the overlapping area (in square km) of the HA basin with the watershed
df['frac_basin_cover'] = df['coverage'] / df['sub_area'] # 'sub_area' is provided by HA

MIN_OVERLAP_FRAC = 0.5 # minimum fractional overlap of a HA basin with the provided watershed to consider the HA polygon part of the watershed

# Find ds_basin - the first basin (when walking downstream) that doesn't
# meet the MIN_OVERLAP_FRAC threshold.
ds_idx = np.argmax(df['frac_basin_cover'].values)
ds_basin = df['hybas_id'].values[ds_idx]
while 1:
    ds_basin = df['next_down'].values[ds_idx]
    ds_idx = np.where(df['hybas_id'].values==ds_basin)[0]
    if len(ds_idx) == 0:
        break
    else:
        ds_idx = ds_idx[0]

    if df['frac_basin_cover'].values[ds_idx] < MIN_OVERLAP_FRAC:
        ds_basin = df['hybas_id'][ds_idx]
        break

# Find all HA basins that drain to ds_basin and are within the watershed polygon
check_set = set([ds_basin])
us_set = set()
while check_set:
    this_basin = check_set.pop()
    this_us = set(df['hybas_id'].values[np.logical_and(df['next_down']==this_basin, df['frac_basin_cover'] > MIN_OVERLAP_FRAC)].tolist())
    check_set.update(this_us)
    us_set.update(this_us)
# Note that us_set does not include ds_basin (which is what we want)

This would be harder to implement in GEE but not impossible (I hope). Maybe could do the processing locally but might fail for really large watersheds.

Anyway, I agree that the use of HydroATLAS is purely convenience and we should move away from it if possible. We already have identified and/or uploaded many of the underlying HydroATLAS datasets ourselves to GEE as part of our rabpro package. It's easy to envision putting all of these into a Caravan script to sample with a single GEE call, but there are additional datasets we'd have to upload. (Some of them like Kummu's GDP I already have in my personal GEE assets, as well as arctic-specific ones like permafrost probability, volumetric ice content, glacier coverage, etc.). One of the benefits of this approach is that you could tie versions of Caravan code to versions of data, so as datasets get updated with newer versions, the Caravan code could also be updated easily and re-run to fetch newest-version catchment attributes. The GSW dataset is a good example of one that gets updated fairly frequently and would be useful to sample for Caravan datasets.

It seems like that's the right way to go about it, but it would definitely be messier and require more work to assemble the full set of attribute rasters. For CAMELS-Arctic, we are already doing this for the additional arctic-specific datasets that aren't in HydroATLAS (as well as re-sampling some of the HA attributes that have known poor quality above 60N as you mentioned). If I have time, I will try to assemble a list of the datasets already on GEE (or user-contributed) that overlap with HA--it could be that there are only a few that are missing. Also, we could potentially drop some of the HA attributes that we don't feel are that valuable (or for which proxies are computed from the ERA5-Land like aridity index or PET etc.).

One other thing to consider is that even if we directly-sampled rasters ourselves, the problem with accumulated variables still exists (and is potentially more complicated). I have worked on that problem a bit and I don't think there's any general method that works perfectly for all accumulated datasets (implying one method per dataset). Taking a high percentile rather than the max alleviates most of the problems, but it's not perfect.

kratzert commented 1 year ago

Thanks again for your thoughts on this. The only thing I would correct in your suggestion is that in step 3, we do not want to take the maximum of the pour point properties from green and blue but rather their sum. All of the pour point properties are either volumes (e.g. reservoir volume) or areas (e.g. lake area) and we want the total area/volume in the catchment, not only from the side arm with the max values, right?

I also found another termination condition that we need to add, which is to check if the NEXT_DOWN == 0 (which is the HYBAS_ID for the ocean. Otherwise, for coastal catchments, the check for "all polygons that drain into the most downstream polygon (here, 0)" would potentially return neighboring/unrelated catchments.

Given the format of the data in the jupyter notebook of Caravan, I think this function covers all cases and computes the pour_point_properties as we discussed

def compute_pour_point_properties(basin_data):
    # Find basin with the largest overlap (in percent).
    percentage_overlap = [x / y for x, y in zip(basin_data['weights'], basin_data["SUB_AREA"])]
    current_basin_pos = np.argmax(percentage_overlap)

    # Get the HYBAS_ID of the next downstream gauge.
    next_down_id = basin_data['NEXT_DOWN'][current_basin_pos]

    # Traverse the river network downstream until we hit the termination condition.
    while True:

        # Make sure that we did not reached the ocean (HYBAS_ID=0), otherwise we would later pick all intersecting
        # polygons that also drain into HYBAS_ID = 0.
        if next_down_id == 0:
            break

        # Check if next_down_id is in the list of HYBAS_IDs.
        if next_down_id not in basin_data['HYBAS_ID']:
            break

        # Get position of next_down_id in the list of HYBAS_IDs
        next_down_pos = basin_data['HYBAS_ID'].index(next_down_id)

        # Check that the intersection of the next downstream polygon is at least 50%
        if percentage_overlap[next_down_pos] < 0.5:
            break

        # Continue with the next polygon.
        next_down_id = basin_data['NEXT_DOWN'][next_down_pos]

    # Find all polygons that would have drained into that polygons, which can potentially be more than one in case where
    # two (or more) rivers merge.
    direct_upstream_polygons = []
    for i, next_down in enumerate(basin_data['NEXT_DOWN']):
        if (next_down == next_down_id) and (basin_data['weights'][i] > MIN_OVERLAP_THRESHOLD):
            direct_upstream_polygons.append(i)

    # Finally compute metrics. As all pour_point_properties are volumes or areas, we can simply compute the sum.
    aggregated_properties = {}
    for prop in pour_point_properties:
        aggregated_properties[prop] = sum([basin_data[prop][i] for i in direct_upstream_polygons])

    return aggregated_properties

WDYT? From all polygons that I checked manually (looked at some random CAMELS polygons) the algorithm seems to do what it should do.

However, this would only explain the differences in the attributes that I had listed as pour_point_attributes, which were

# List of features for which we take the value of the most downstream polygon
pour_point_properties = [
     'dis_m3_pmn', # natural discharge annual mean
     'dis_m3_pmx', # natural discharge annual max
     'dis_m3_pyr', # natural discharge annual min
     'lkv_mc_usu', # Lake Volume
     'rev_mc_usu', # reservoir volume
     'ria_ha_usu', # River area
     'riv_tc_usu', # River volume
]

Two from your list, where I computed the area-weighted spatial average, should actually be pour-point properties as well. These are pop_ct_usu (makes more sense to use pop_ct_usu than to spatially average pop_ct_ssu) and dor_pc_pva (is actually defined as an total upstream area polygon). For pop_ct_usu, we can also take the sum (in case of two polygons draining into the most-downstream polygon as in my example above), for degree of regulation it gets more tricky. This feature is defined as "percent", so taking the sum makes no sense. I think here we need to take into account the UP_AREA to compute an area weighted aggregation. This would change the last part of the function to

# Finally compute metrics. As all pour_point_properties are volumes or areas, we can simply compute the sum.
aggregated_properties = {}
for prop in pour_point_properties:
    if prop == 'dor_pc_pva':
        weighted_prop = [basin_data[prop][i] * basin_data['UP_AREA'][i] for i in direct_upstream_polygons]
        total_up_area = sum([basin_data['UP_AREA'][i] for i in direct_upstream_polygons])
        aggregated_properties[prop] = sum(weighted_prop) / total_up_area
    else:
        aggregated_properties[prop] = sum([basin_data[prop][i] for i in direct_upstream_polygons])

Would love to get your thoughts on this.

Then there are only three attributes left from your table. These are tmp_dc_syr, wet_pc_s06 and gpd_ud_ssu. I don't see why tmp_dc_sry and wet_pc_s06 are in this list and what is the problem with them (compared to e.g. all other wetland classes). These are simple attributes where we can compute the area weighted spatial average as for most of the other attributes, right? gpd_ud_ssu on the other hand, after looking more carefully at the data, does not really make any sense to me. I need to think about this feature some more but eventually I would remove it altogether.

jonschwenk commented 1 year ago

Nice--that looks good to me. I agree that taking the maximum is a mistake and that summing the next-upstream basins is the right(est) thing to do. I suggested the maximum because I was thinking only of dor_pc_pva at the time, and taking the maximum seemed like a decent-enough approximation. But I think the special handling you suggested in the second code block is easy enough and more accurate. I will implement on my end and see if anything looks suspect.

I should note that that table I showed above is sorted by largest % difference (mean_abs_err) and contains all the HA attributes in Caravan--I just posted a snippet of the highest differences. The reason that tmp_dc_syr and wet_pc_s06 (and others I did not show in the table snippet) show up is my poor choice of computing % difference:

tmp_dc_syr - the issue is that I normalized by the mean, which is very near 0 because tmp_dc_syr can take negative values. Agreement is actually good between the two:

wet_pc_s06 - same problem--the mean is 0.006. Here are the raw differences that show near-perfect agreement:

To get a clearer picture, I looked at the RMSE normalized by the range of each variable (times 100 to express as a percent): The takeaway here is that normalized RMSE is < 0.1% for all variables except the 13 in the black oval. Of those 13, we've discussed/resolved all of them in this issue except the spatial majority ones. I will look into those more closely to see if they're anything to be concerned about.

One of the main reasons (I think) we don't get perfect agreement is that I did not employ a MIN_OVERLAP_THRESHOLD. I think there are arguments for/against implementing it, but from what I've seen, its effect is essentially insignificant (maybe not the case for really tiny basins or those that heavily disagree with HydroATLAS catchment boundaries).

I will report if I find anything fishy with the spatial majority attributes, but it seems a fairly straightforward aggregation. I think you can consider this issue closed for now 👍 Thanks for working through it with me!

One final note: my reading of the documentation suggests that some of the units of HydroATLAS attributes are rescaled. E.g. tmp_dc_syr has been multiplied by 10, so it's not exactly degrees Celsius. You probably are aware, but might trip someone up if they assumed. Not sure how best to communicate that to potential users, but if they're just plugging these into ML models maybe it doesn't matter...

jonschwenk commented 1 year ago

Just a note that the spatial majority discrepancies were due to a bug on my end. I have updated all my locally-run code with the new scheme for aggregated properties including special handling of dor_pc_pva. If/when you implement this into the Caravan code, I can compare again.

kratzert commented 1 year ago

Thanks, I am trying to get this done this week. Thanks again for reporting this and brainstorming a solution together.

kratzert commented 1 year ago

Hi @jonschwenk

sorry, took me a while to get this done. Attached you can find the updated attributes and I would love if you can check if they align with your values. I manually checked a couple of gauges for the pour point attributes and it looks good to me. A few notes:

The structure of the attribute files was slightly changed. They are now split into three files, with a new attributes_other*.csv file added that contains metadata, like country, river name, lat/long and area, while the hydroatlas attribute file only contains hydroatlas attributes and the caravan file only contains the climate indices we derived.
The attribute files already contain data for ~4500 additional basins that will be added in the update that I am about to submit (just waiting for your check ;) )

updated_attributes.zip

kratzert commented 1 year ago

I already updated the code and pushed it to a branch. You can check the implementation at https://github.com/kratzert/Caravan/blob/update-code/code/Caravan_part1_Earth_Engine.ipynb (see compute_pour_point_properties() function). This will soon be merged into master.

jonschwenk commented 1 year ago

Happy Holidays! Getting back to this now. Working through the updated code.

I posted an issue here about pop_ct_usu but I had failed to copy your new property lists correctly, so I think it's fine.

Will update with anything else I find and finally a quantitative comparison between our methods as above.

jonschwenk commented 1 year ago

Ok, I think I'm happy with the agreement between our approaches. There are some differences in the final results, and sometimes these are large. These all occur for the pour point (or upstream) properties and are related to the slightly different approaches we take. Our algorithms are essentially the same, but I've tracked down our differences to (at least looking at only 3 cases) the choice of Caravan's MIN_OVERLAP_THRESHOLD vs VotE's min_overlap_fraction. Here's an example:

In this case (and the other two I looked at), both methods identify the same downstream HYBAS_ID. However, Caravan deems there to be two direct_upstream_polygons while VotE only has one. This is because Caravan uses a MIN_OVERLAP_THRESHOLD in terms of area (5 km^2 in this case), whereas VotE uses min_overlap_fraction (0.5 in this case) that enforces that at least half of the potential upstream HA basin lies within the provided watershed. In this case, and the others, Caravan's inclusion of the extra basin is unwanted, as this extra basin actually belongs to the main river and not the tributary on which the gage sits. So the value of 77995 is used by Caravan to compute rev_mc_usu, but the "true" value should be 0. You can also see how just a few discrepancies like this throw off the average error statistics--a slight difference in method can lead to a huge difference in result. In general, VotE's method is more conservative in the sense that it's more likely to err on the side of underprediction (assuming a low-ish value for MIN_OVERLAP_THRESHOLD like 5 km^2).

One other note: VotE uses an algorithmic approach to determine if a HydroAtlas property is pour_point, upstream, etc. The property naming convention allows this. I don't see any discrepancies between your typed-out approach and mine in terms of aggregation strategies, but I didn't check all ~280 properties. Just to demonstrate, here's my code block that handles that:

    agged = {}
    for a in attrs:

        if use_caravan_attributes is True and a not in car_attrs:
            continue

        agg = a[8:10]

        if a == 'ele_mt_smn':
            agged[a] = np.nanmin(all_us_basins[a])
        elif a == 'ele_mt_smx':
            agged[a] = np.nanmax(all_us_basins[a])
        elif a == 'dor_pc_pva': # degree of record as a percent; it is an
                                # accumulated value so we use a weighting of 
                                # immediate_us_basins
            agged[a] = np.average(immediate_us_basins[a].values, weights=immediate_us_basins['frac_total_cover'].values)
        elif a[7] in ['p', 'u']: # pour point/upstream (i.e. accumulated) values; 
                                 # sum immediate_us_basins for these aggregations since they're volumes 
                                 # or areas with the exception of dor, which gets handled separately.
            agged[a] = np.nansum(immediate_us_basins[a])
        else:
            if agg == 'su': # "weighted" sum 
                agged[a] = np.nansum(df[a] * df['frac_basin_cover'])
            elif agg == 'mj': #majority - there should be no nans in these attributes
                dfc = df[[a, 'coverage']].groupby(a).aggregate(sum)
                agged[a] = dfc.index.values[dfc['coverage']==np.max(dfc['coverage'])][0]
            elif agg in ['av', 'se', 'yr', 'mn', 'mx', 'lt', 'va'] \
                or a[:3] in ['pnv', 'wet', 'glc', 'tmp', 'pre', 'pet', 'aet', 'cmi', 'snw', 'swc', 'hft']:
                validindices = ~np.isnan(df[a].values)
                if len(validindices) == 0:
                    agged[a] = np.nan
                else:
                    agged[a] = np.average(df[a][validindices], weights=df['frac_total_cover'].values[validindices])
            else:
                print('Attibute {} has no strategy.'.format(a))

Anyway, I'm happy with the agreement and this exercise has both improved my code and made me aware of some nuances that can drastically alter the results. My overall impression is unchanged in that this is just not the right way to go about these aggregations--they should be resampled from original data. And honestly, 200+ properties is too many to be useful anyway, especially considering none of them have known uncertainty/errors. A better approach, IMO, would be to select maybe 20-30 max and just sample them directly.