Retrieve records availability from NWIS

qideng7 commented 8 months ago

Is your feature request related to a problem?

Currently, the get_info function from NWIS does not return info like data availability range.

from pygeohydro import NWIS

nwis = NWIS()
SiteID = "01636500"
ParamCd = "00060"
query = {
    "site": SiteID,
    "parameterCd": ParamCd,
    "siteStatus": "all",
}
SiteInfo = nwis.get_info(query, expanded=True)
print(SiteInfo.columns)

Index(['agency_cd', 'site_no', 'station_nm', 'site_tp_cd', 'dec_lat_va',
       'dec_long_va', 'coord_acy_cd', 'dec_coord_datum_cd', 'alt_va',
       'alt_acy_va', 'alt_datum_cd', 'huc_cd', 'lat_va', 'long_va',
       'coord_meth_cd', 'coord_datum_cd', 'district_cd', 'state_cd',
       'county_cd', 'country_cd', 'land_net_ds', 'map_nm', 'map_scale_fc',
       'alt_meth_cd', 'basin_cd', 'topo_cd', 'instruments_cd',
       'construction_dt', 'inventory_dt', 'drain_area_va',
       'contrib_drain_area_va', 'tz_cd', 'local_time_fg', 'reliability_cd',
       'gw_file_cd', 'hcdn_2009', 'geometry'],
      dtype='object')

But it's available in the xarray retrieved using get_streamflow function, begin_date and end_date .

SiteFlow = nwis.get_streamflow(SiteID, dates=("2010-01-01", "2010-01-05"), to_xarray=True)
SiteFlow

I feel it's better to examine the availability range, then decide the dates we use in get_streamflow.

Describe the solution you'd like

It's directly available thru NWIS site service by setting seriesCatalogOutput to True:

url = f"https://waterservices.usgs.gov/nwis/site/?format=rdb&sites=01636500&seriesCatalogOutput=true&siteStatus=all&hasDataTypeCd=dv&outputDataTypeCd=dv"
r = requests.get(url, allow_redirects=True)
content = r.content.decode('utf-8')
lines = content.split('\n')
start_index = next(i for i, line in enumerate(lines) if not line.startswith('#'))
column_names = lines[start_index].split('\t')
data_rows = [line.split('\t') for line in lines[start_index+2:] if line.strip()]
df = pd.DataFrame(data_rows, columns=column_names)
df

Describe alternatives you've considered

No response

Additional context

No response

cheginit commented 8 months ago

Thanks for the suggestion. Although you can just do:

query = {
    "site": SiteID,
    "parameterCd": ParamCd,
    "siteStatus": "all",
    "seriesCatalogOutput": "true",
}
SiteInfo = nwis.get_info(query)

I think, adding it as a default arg makes sense. Note that this cannot be used with expanded=True.

cheginit commented 8 months ago

Just realized what's the issue. If you want to get the being_date and end_date with the basic request, you need to explicitly pass outputDataTypeCd as a request param. Using seriesCatalogOutput by default is not a good idea, since it returns ALL begin and end dates, regardless of the parameter code. So, you will end up with many rows for each station. So, for your case, you can do:

qideng7 commented 8 months ago

Thank you! You are right, the best solution should be setting outputDataTypeCd. And I got date strings error with seriesCatalogOutput.

{
    "name": "ValueError",
    "message": "unconverted data remains when parsing with format \"%Y\": \"-09-11\", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 5
      1 query = {
      2     \"site\": \"01636500\",
      3 \t\"seriesCatalogOutput\": \"true\",
      4 }
----> 5 SiteInfo = nwis.get_info(query)

File c:\\Users\\TUQD\\anaconda3\\envs\\HyRiver-2\\Lib\\site-packages\\pygeohydro\
wis.py:383, in NWIS.get_info(self, queries, expanded, fix_names, nhd_info)
    380     numeric_cols += [\"drain_area_va\", \"contrib_drain_area_va\"]
    382 with contextlib.suppress(KeyError):
--> 383     sites[\"begin_date\"] = pd.to_datetime(sites[\"begin_date\"])
    384     sites[\"end_date\"] = pd.to_datetime(sites[\"end_date\"])
    386 if nhd_info:

File c:\\Users\\TUQD\\anaconda3\\envs\\HyRiver-2\\Lib\\site-packages\\pandas\\core\\tools\\datetimes.py:1063, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1061             result = arg.tz_localize(\"utc\")
   1062 elif isinstance(arg, ABCSeries):
-> 1063     cache_array = _maybe_cache(arg, format, cache, convert_listlike)
   1064     if not cache_array.empty:
   1065         result = arg.map(cache_array)

File c:\\Users\\TUQD\\anaconda3\\envs\\HyRiver-2\\Lib\\site-packages\\pandas\\core\\tools\\datetimes.py:247, in _maybe_cache(arg, format, cache, convert_listlike)
    245 unique_dates = unique(arg)
    246 if len(unique_dates) < len(arg):
--> 247     cache_dates = convert_listlike(unique_dates, format)
    248     # GH#45319
    249     try:

File c:\\Users\\TUQD\\anaconda3\\envs\\HyRiver-2\\Lib\\site-packages\\pandas\\core\\tools\\datetimes.py:433, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    431 # `format` could be inferred, or user didn't ask for mixed-format parsing.
    432 if format is not None and format != \"mixed\":
--> 433     return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
    435 result, tz_parsed = objects_to_datetime64(
    436     arg,
    437     dayfirst=dayfirst,
   (...)
    441     allow_object=True,
    442 )
    444 if tz_parsed is not None:
    445     # We can take a shortcut since the datetime64 numpy array
    446     # is in UTC

File c:\\Users\\TUQD\\anaconda3\\envs\\HyRiver-2\\Lib\\site-packages\\pandas\\core\\tools\\datetimes.py:467, in _array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)
    456 def _array_strptime_with_fallback(
    457     arg,
    458     name,
   (...)
    462     errors: str,
    463 ) -> Index:
    464     \"\"\"
    465     Call array_strptime, with fallback behavior depending on 'errors'.
    466     \"\"\"
--> 467     result, tz_out = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
    468     if tz_out is not None:
    469         unit = np.datetime_data(result.dtype)[0]

File strptime.pyx:501, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:451, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:587, in pandas._libs.tslibs.strptime._parse_with_format()

ValueError: unconverted data remains when parsing with format \"%Y\": \"-09-11\", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this."
}

cheginit commented 8 months ago

The string issue is due to having dates as only year in some cases (for example, just 2006 instead of a full date) when using seriesCatalogOutput. I added a fix for handling such cases, so from the next version, this exception will not be raised, and will work without any issue.

qideng7 commented 8 months ago

Sounds great, Thank you!

hyriver / pygeohydro