Closed kratzert closed 1 year ago
Hey, dear Frederik! I'm very excited about this extension. Thank you very much for your continued contributions to the hydrology community.
But I found something inconsistent of the meteorological data in the new GRDC extension dataset, hope you can check it out. GRDC_4208527 is the same hydrological station as 07EE009, according to file 'national_station_ids.csv'. Hence, the data in 'GRDC_4208527.csv' should be the same as that of 'hysets_07EE009.csv' at the same date. The runoff data is relatively consistent. However, the 'total_precipitation_sum' and 'potential_evaporation_sum' are quiet different. Moreover, the mean runoff is much larger than mean precipitation in 'GRDC_4208527.csv'. Similar issues exist in 'hysets_07EC004.csv' and 'GRDC_4208507.csv'.
Hi, I am currently on vacation and will come back to this once I return (next week). Thanks for reporting though
I returned today and just had a quick look at your reported gauges. I can confirm that I see the same issue. I had no time to do a deep dive yet but here is my current hypothesis: total_precipitation and potential_evaporation features (same for surface_net_solar_radiation and surface_net_thermal_radiation) are accumulated features over the day, where as all other features are instantaneous features, according to the ERA5-Land definition. This requires an additional step in the feature computation, which is that we first need to dis-aggregate the hourly ERA5 features for the accumulated features into hourly "instantaneous" features, before we shift the hourly ERA5 data to local timezone to then compute the daily aggregates in local time. This is done in this function https://github.com/kratzert/Caravan/blob/main/code/caravan_utils.py#L323. However, the GRDC extension, due to the size of this extension, was actually run with slightly different code on Google infrastructure. My current thought is that maybe the dis-aggregation wasn't called correctly. If this is true, this should affect all gauges in the GRDC extension, which I will check asap. You can see that all instantaneous features match perfectly, also for the two set of gauges you posted.
Regarding runoff to streamflow ratio: If my hypothesis from above is correct, the current total_precipitation_sum is not actually the mm/day precipitation but rather the "average hourly precipitation" on that calendar day. And then it probably makes sense that the rainfall value is smaller than the runoff value.
I will keep you posted but thanks for reporting your finding.
Update: I checked a few additional gauges that have duplicates between this GRDC extension and original Caravan gauges. All seem to be affected by the same problem. It is midnight here now and I will not dig further into it today, but I will look into this tomorrow. I also added a note to #10 so that people are aware of this ongoing investigation.
Last update: I think I found the problem already and if it is true, it should be easy to fix. I think the reason is that some internal code that I used to load the raw timeseries data already dis-aggregated the instantaneous features and then I did it again with the Caravan code. Again, this is only a problem here, because this extension used a mix of Google internal code/infrastructure and the public Caravan code. I should know more tomorrow.
Okay, I found a free minute to look at it and it is indeed the problem that I described above. Which fortunately also means the fix is quite easy on my end, as I only need to comment out the explicit de-accumulation that I added (since the data return from our internal code is already "instantaneous" even for features that are accumulated features as by definition of ERA5-Land/ECMWF.
Long story short here are a few plots to show the problem and show the data after fix:
Here is precipitation of the gauge that you mentioned above (GRDC_4208527 and hysets_07EE009).
Here is temperature
Summary As you reported, precipitation (same is true for the three other accumulated bands) has much lower values, instantaneous features like temperature are unaffected and basically perfectly align for the period that is shared between the GRDC extension and the same gauge from within the HYSETS data.
Here is precipitation after the fix.
And here is a scatter plot of the overlapping data for precipitation after the fix. Note, the minimal differences are possible due to e.g. minimal differences in the basin polygons between the HYSETS polygons and the GRDC basin polygons.
Temperature is not affected, so I spare the plot (it looks identical to above).
I already processed the forcing data for all GRDC gauges. The only thing left that I need to do is to merge the data with the streamflow data, recompute the climate attributes and then send the data for upload to the GRDC guys. Tomorrow is state holiday and the birthday of my son, but I hope I get this done by EOW.
Thanks for your efforts. Happy holidays to you and happy birthday to your child!
Just to keep you updated, the extension is ready for over a week now but there are problems with uploading the data to Zenodo, from the GRDC side. If this will take much longer, we might consider uploading the updated data to a different data sharing platform (e.g. HydroSHARE)
can't wait to see the newest data :satisfied:
The updated data is uploaded and can be found here (v.0.2). Note: There are 4 gauges that were just identified to have slight errors in lat/long and there polygons. These gauges are GRDC_5606274, GRDC_5606174, GRDC_5202086, GRDC_5202088. For now, maybe ignore these 4 gauges, but please report any other inconsistency/error that you find.
Basin prefix
grdc
Zenodo DOI
https://zenodo.org/records/10074416
Number of catchments
5357
Location of catchments
Globally (25 countries)
For which periods are streamflow records available in your dataset?
1950-2023, varying length for each basin.
Please list any sources of the data you contributed.
License
CC-BY-4.0
Additional context
Authors: Färber, Claudia; Plessow, Henning; Kratzert, Frederik; Addor, Nans; Shalev, Guy; Looser, Ulrich
The extension includes a subset of those hydrological discharge data and station-based watersheds from the Global Runoff Data Centre (GRDC), which are covered by an open data policy (Attribution 4.0 International; CC BY 4.0). In total, the dataset covers stations from 5357 catchments and 25 countries worldwide with a time series record from 1950 – 2023.
GRDC is an international data centre operating under the auspices of the World Meteorological Organization (WMO) at the German Federal Institute of Hydrology (BfG). Established in 1988, it holds the most substantive collection of quality assured river discharge data worldwide. Primary providers of river discharge data and associated metadata are the National Hydrological and Hydro-Meteorological Services of WMO Member States.
Because of the size of this extension, we provide an archive with all timeseries data as csv files and one archive with all timeseries data as netcdf files. Both are available from the Zenodo link.
Note: This extension contains basins of all sizes, ignoring the 2000km2 threshold. In order to be able to process really large basins on EarthEngine, we slightly adapted the script that computes the attributes, which will be pushed to this repository soon.
Checklist