Check COMID match for 3 gages

jds485 commented 1 month ago

I found 3 gages that may be matched to an incorrect COMID. Below I've listed the gage ID, proposed new COMID, and the current COMID from nhdplusTools::get_nldi_feature. All these proposed edits are based on looking at maps of the gage locations and NHDplusV2 network flowlines.

COMID = case_when(GAGE_ID == '09331950' ~ '4894919', #curr COMID "4894001"
                                 GAGE_ID == '12392895' ~ '24115829', #curr COMID "24115835"
                                 GAGE_ID == '04219767' ~ '15561021', #curr COMID "15559985"

dblodgett-usgs commented 1 month ago

Thanks for calling these out @jds485.

For https://waterdata.usgs.gov/monitoring-location/09331950/#period=P1Y&showMedian=true NWIS has a drainage area of 13.6 sqmi but the comid it appears to be right on top of has a drainage area of 127 square miles. Something doesn't add up. The comid that it is on in reference gages is actually much too small but closer (1 square mile). Any idea if that NWIS drainage area could be off by a factor of 10 or so?

https://waterdata.usgs.gov/monitoring-location/12392895/#period=P1Y&showMedian=true is the same case. (13.5sqmi NWIS vs 30sqmi NHDPlus)

As is https://waterdata.usgs.gov/monitoring-location/04219767/#parameterCode=00065&period=P7D&showMedian=false (71 sqmi vs 1255 sqmi)

In these cases, I can switch it to your suggestion by just filtering out cases where the NWIS drainage area much too small to make sense with any near by NHDPlusV2 flowline. The question is -- are these maybe instances where NHDPlusV2 just doesn't have a flowline for the drainage in question? Would it be better to have no network association in these cases that are highly uncertain?

jds485 commented 1 month ago

Any idea if that NWIS drainage area could be off by a factor of 10 or so?

I am not sure - I do not know these gages well enough to say.

The question is -- are these maybe instances where NHDPlusV2 just doesn't have a flowline for the drainage in question?

I also wonder if the gage locations (lat, long) are correct. I'd be happy to ask the WSCs for more info and can send them a link to this issue. Let me know if you'd like for me to do that.

Would it be better to have no network association in these cases that are highly uncertain?

Maybe a new dataset column indicating that the reference network flowline drainage area and the NWIS listed drainage area are in disagreement (different by some factor), and then users can decide if they want to use the gage in their studies. I'm not sure if that new column would be in NWIS, in the NLDI or both places. I have found that QAQC by looking at drainage areas and drainage area ratios (to evaluate gage nestedness) is effective for identifying cases like these.

dblodgett-usgs commented 1 month ago

In this case, a check with the science center is probably in order. We just don't have evidence to say what's right and wrong.

I've come back to this issue of gage/network association quality many times and still haven't found a generalization that I'm happy with. At the end of the day, we have a set of match evidence (nearness, name, drainage area, etc.), and some strength or weakness metric that is scoped to the line of evidence.

Since many gages don't have a drainage area or the stream a gage is on/near doesn't have a name, the match / match strength field is very often NULL for everything except nearness.

Taking the drainage area metric on its face -- I would argue that when we have drainage area mismatch with any reasonably near by flowline of greater than some threshold we would just throw out that evidence as seemingly a blunder (in the gage area or the hydrography data). That could lead to an arrangement where we have one column per match criteria that is NULL if the criteria did not apply and some numerical or categorical value scoped to the criteria in question.

In the case of drainage area, each gage/network association represents a hydrologic location along the hydrographic network. IF we were to add a drainage area match quality metric, I would want to do it as the normalized difference (expressed as a %) between the hydrographic estimate of drainage area and the stated estimate of drainage area of the gage.

100 * (A_hydrography - A_gage) / A_hydrography = (percent_diff_A_hydro-gage)

Given that gages are along flowlines (some distance upstream of a confluence) and flowline drainage area estimates apply at the flowline outlet, we would generally expect that hydrography drainage area estimate to be very slightly larger than the gage. So we would expect percent_diff_A_hydro-gage to be a small positive number corresponding to the partial catchment downstream of the gage location.

Does this track? I've been wanting to crack this nut for a long long time and am ready to do a bit more on reference gages in the next few months. Maybe I can work something like this in.

jds485 commented 1 month ago

Yes, that sounds good to me. I think you will find some negative percent differences based on my recent experience comparing NWIS drainage areas and network drainage areas for about 4500 gage sites. The TOT_BASIN_AREA is and NHDPlusV2 estimate from Mike Wieczorek's ScienceBase data release. TOT_DA_vs_gauge_DA_log10 There are some NWIS areas that are larger than their NHD-based flowline outlet area estimates. Some of those sites are affected by flow alterations and diversions, so maybe one data source considered the area of the watershed diverting flows as well. And sometimes the NWIS contributing drainage area is a closer match to the NHDPlusV2-based drainage area. I dropped all those from my work because I did not know what to use. A percent difference metric could help to quickly identify sites like these

dblodgett-usgs commented 1 month ago

OK -- I'm working through this and #26 now. One point of potential complexity is that I have some prior network associations in the mix that I evaluate based on drainage area. Right now, I just throw out pre-existing associations if the drainage area difference is more than 50% different than the stated gage drainage area.

Should I do any record keeping around this filtering of problematic drainage area? I use this greater than 0.5:

abs(all_gages$nhdpv2_totdasqkm - all_gages$drainage_area_sqkm) / all_gages$drainage_area_sqkm

For example:

https://waterdata.usgs.gov/monitoring-location/08363100

Gage drainage area is 0.4 sqmi / 1 sqkm and SWIM has it on the Rio Grande. In reality there is no flowline for this gage in NHDPlusV2 and we've just fallen back on what's close in all cases.

another e.g. https://waterdata.usgs.gov/monitoring-location/05517600

The NHDPlusV2 gages layer has that on a 3.5 sqkm flowline but NWIS has it as a 0.7 sqkm drainage area.

I'll just remove these as clear unrealistic associations and move on to the next step (automated search for the best match) but figured I'd jot down the question to see what you think.

jds485 commented 1 month ago

Right now, I just throw out pre-existing associations if the drainage area difference is more than 50% different than the stated gage drainage area. Should I do any record keeping around this filtering of problematic drainage area? I use this greater than 0.5.

This is a good filter to keep, but it will also remove associations for gages that are located in the upper portions of their flowlines. The threshold for the gages I worked with was as high as 4 (5 sq. mi. watershed with a gage in the upper 1 sq. mi) but all the gages I looked at had a COMID associated with it, so I might not be understanding how you apply this filter...maybe your comment about automated search for the best match is why they have associated flowlines?).

In reality there is no flowline for this gage in NHDPlusV2 and we've just fallen back on what's close in all cases.

Does this mean that you would assign this gage to the nearest NHDPlusV2 flowline, even though in this case the gage is not measuring the flow on that flowline? That would explain why I found a lot of drainage area ratios that were enormous and then when I looked at the gage page I saw that the gage was on a tributary of a mapped NHDPlusV2 flowline (so I dropped the gage from my analysis). I would appreciate a column indicating matches like this so that users can inspect sites for their studies as needed.

dblodgett-usgs commented 1 month ago

maybe your comment about automated search for the best match is why they have associated flowlines?

Yes. I double check for automated associations later in the workflow so many of these will come back.

even though in this case the gage is not measuring the flow on that flowline?

Yes -- that's the case. Since the NLDI is focused on discovery, I've been overly generous in the associations. I should probably stop that practice.

dblodgett-usgs commented 1 month ago

I think I've settled on this criteria for now.

  bad_da <- all_gages[!is.na(da_diff) & # has an estimate

                        ((all_gages$drainage_area_sqkm <= 100 & 
                            # use unnormalized because differences so quantized 
                            # due to catchment resolution.
                            # when da_diff is negative, use within 25%
                            (da_diff > 10 | (da_diff < 0 & abs_norm_diff_da > 0.25))) |

                           # is tens of catchments and within 10%
                          (all_gages$drainage_area_sqkm > 100 & 
                             abs_norm_diff_da > (0.1)) | 

                           # is hundreds of catchments and within 5%
                           (all_gages$drainage_area_sqkm > 500 & 
                             abs_norm_diff_da > 0.05)), ]

dblodgett-usgs commented 1 month ago

ref_gages will now have:

The new fields include:

nhdpv2_totdasqkm which corresponds to the linked comid
nhdpv2_link_source which gives a url for what project or dataset created the link
gage_totdasqkm which gives the gage provider's estimate of drainage area.
dasqkm_diff which is the absolute difference between the gage and nhdpv2 drainage area.

jds485 commented 1 month ago

Looks good! I'd recommend rounding the drainage areas to the precision of the original source (maybe nearest 0.3 km^2 because I think 0.1 mi^2 is the reported precision on NWIS)

dblodgett-usgs commented 1 month ago

Good call -- For #26 I'm also going to add the offset from the network snap and move closer to the NLDI holding a distinction between "on network" hydrolocations and "in catchment" hydrolocations for these cases where the gage is on a flowline smaller than is included in the network we are linking to.

jds485 commented 1 month ago

for these cases where the gage is on a flowline smaller than is included in the network we are linking to

Are you able to snap to the high-res network and NHDPlusV2 network simultaneously to catch cases like this? I know the high res network is not available everywhere, but seems like this could be useful where it is available. This would be for a different issue, just giving an idea. I think Lauren used a similar method for NHM1.1 vs. NHDV2 snaps

dblodgett-usgs commented 1 month ago

In the future, yes -- right now, no. In cases like we are looking at above -- there isn't a high res flowline either. Seems like the NWIS drainage area might be off by an order of magnitude.

dblodgett-usgs commented 1 month ago

@jds485 would you be able to do a code review for #39 for me?

jds485 commented 1 month ago

Sure, I can do that

internetofwater / ref_gages

Check COMID match for 3 gages #38