hyriver / pygeohydro

A part of HyRiver software stack for accessing hydrology data through web services
https://docs.hyriver.io
Other
68 stars 23 forks source link

'utf-8' codec error from pynhd #120

Closed qideng7 closed 7 months ago

qideng7 commented 7 months ago

What happened?

When passing nhd_info=True to nwis.get_info() function, got error. I was able to replicate this error in new Colab environment. with pygeohydro-0.16.0 and pynhd-0.16.2

from pygeohydro import NWIS

Outlet = '01500500'
ParamCd = '00060'

nwis = NWIS()

query = {
    "site": Outlet,
    "parameterCd": ParamCd,
    "siteTypeCd": "ST",
    "hasDataTypeCd": "dv"
}
Outlet_gdf = nwis.get_info(query, expanded=True, nhd_info=True)

What did you expect to happen?

It has been working recently, but got error today.

Minimal Complete Verifiable Example

No response

MVCE confirmation

Relevant log output

UnicodeDecodeError                        Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/async_retriever/_utils.py](https://localhost:8080/#) in retriever(uid, url, s_kwds, session, read_type, r_kwds, raise_status)
     81         try:
---> 82             return uid, await getattr(response, read_type)(**r_kwds)
     83         except (ClientResponseError, ValueError) as ex:

17 frames
[/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py](https://localhost:8080/#) in text(self, encoding, errors)
   1147 
-> 1148         return self._body.decode(  # type: ignore[no-any-return,union-attr]
   1149             encoding, errors=errors

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 31378: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
[<ipython-input-2-89c816bf29ae>](https://localhost:8080/#) in <cell line: 14>()
     12     "hasDataTypeCd": "dv"
     13 }
---> 14 Outlet_gdf = nwis.get_info(query, expanded=True, nhd_info=True)

[/usr/local/lib/python3.10/dist-packages/pygeohydro/nwis.py](https://localhost:8080/#) in get_info(self, queries, expanded, fix_names, nhd_info)
    385 
    386         if nhd_info:
--> 387             nhd = self._nhd_info(sites["site_no"].to_list())
    388             sites = pd.merge(sites, nhd, left_on="site_no", right_on="site_no", how="left")
    389 

[/usr/local/lib/python3.10/dist-packages/pygeohydro/nwis.py](https://localhost:8080/#) in _nhd_info(site_ids)
    296         except (TypeError, IntCastingNaNError):
    297             area["comid"] = area["comid"].astype("Int32")
--> 298         nhd_area = pynhd.streamcat("fert", comids=area["comid"].dropna().to_list())
    299         area = area.merge(
    300             nhd_area[["COMID", "WSAREASQKM"]], left_on="comid", right_on="COMID", how="left"

[/usr/local/lib/python3.10/dist-packages/pynhd/nhdplus_derived.py](https://localhost:8080/#) in streamcat(metric_names, metric_areas, comids, regions, states, counties, conus, percent_full, area_sqkm)
    666         A dataframe with the requested metrics.
    667     """
--> 668     sc = StreamCatValidator()
    669     names = [metric_names] if isinstance(metric_names, str) else metric_names
    670     names = [sc.alt_names.get(s.lower(), s.lower()) for s in names]

[/usr/local/lib/python3.10/dist-packages/pynhd/nhdplus_derived.py](https://localhost:8080/#) in __init__(self)
    533 class StreamCatValidator(StreamCat):
    534     def __init__(self) -> None:
--> 535         super().__init__()
    536 
    537     def validate(

[/usr/local/lib/python3.10/dist-packages/pynhd/nhdplus_derived.py](https://localhost:8080/#) in __init__(self)
    508 
    509         url_vars = f"{self.base_url}/variable_info.csv"
--> 510         names = pd.read_csv(io.StringIO(ar.retrieve_text([url_vars])[0]))
    511         names["METRIC_NAME"] = names["METRIC_NAME"].str.replace(r"\[AOI\]|Slp[12]0", "", regex=True)
    512         names["SLOPE"] = [

[/usr/local/lib/python3.10/dist-packages/async_retriever/async_retriever.py](https://localhost:8080/#) in retrieve_text(urls, request_kwds, request_method, max_workers, cache_name, timeout, expire_after, ssl, disable, raise_status)
    500     '01646500'
    501     """
--> 502     return retrieve(
    503         urls,
    504         "text",

[/usr/local/lib/python3.10/dist-packages/async_retriever/async_retriever.py](https://localhost:8080/#) in retrieve(urls, read_method, request_kwds, request_method, max_workers, cache_name, timeout, expire_after, ssl, disable, raise_status)
    433     results = (loop.run_until_complete(session(url_kwds=c)) for c in chunked_reqs)
    434 
--> 435     resp = [r for _, r in sorted(tlz.concat(results))]
    436     if new_loop:
    437         loop.close()

[/usr/local/lib/python3.10/dist-packages/async_retriever/async_retriever.py](https://localhost:8080/#) in <genexpr>(.0)
    431     chunked_reqs = tlz.partition_all(max_workers, inp.url_kwds)
    432     loop, new_loop = utils.get_event_loop()
--> 433     results = (loop.run_until_complete(session(url_kwds=c)) for c in chunked_reqs)
    434 
    435     resp = [r for _, r in sorted(tlz.concat(results))]

[/usr/local/lib/python3.10/dist-packages/nest_asyncio.py](https://localhost:8080/#) in run_until_complete(self, future)
     96                 raise RuntimeError(
     97                     'Event loop stopped before Future completed.')
---> 98             return f.result()
     99 
    100     def _run_once(self):

[/usr/lib/python3.10/asyncio/futures.py](https://localhost:8080/#) in result(self)
    199         self.__log_traceback = False
    200         if self._exception is not None:
--> 201             raise self._exception.with_traceback(self._exception_tb)
    202         return self._result
    203 

[/usr/lib/python3.10/asyncio/tasks.py](https://localhost:8080/#) in __step(***failed resolving arguments***)
    230                 # We use the `send` method directly, because coroutines
    231                 # don't have `__iter__` and `__next__` methods.
--> 232                 result = coro.send(None)
    233             else:
    234                 result = coro.throw(exc)

[/usr/local/lib/python3.10/dist-packages/async_retriever/async_retriever.py](https://localhost:8080/#) in async_session_with_cache(url_kwds, read, r_kwds, request_method, cache_name, timeout, expire_after, ssl, raise_status)
    233             for uid, url, kwds in url_kwds
    234         )
--> 235         return await asyncio.gather(*tasks)  # pyright: ignore[reportGeneralTypeIssues]
    236 
    237 

[/usr/lib/python3.10/asyncio/tasks.py](https://localhost:8080/#) in __wakeup(self, future)
    302     def __wakeup(self, future):
    303         try:
--> 304             future.result()
    305         except BaseException as exc:
    306             # This may also be a cancellation.

[/usr/lib/python3.10/asyncio/tasks.py](https://localhost:8080/#) in __step(***failed resolving arguments***)
    230                 # We use the `send` method directly, because coroutines
    231                 # don't have `__iter__` and `__next__` methods.
--> 232                 result = coro.send(None)
    233             else:
    234                 result = coro.throw(exc)

[/usr/local/lib/python3.10/dist-packages/async_retriever/_utils.py](https://localhost:8080/#) in retriever(uid, url, s_kwds, session, read_type, r_kwds, raise_status)
     83         except (ClientResponseError, ValueError) as ex:
     84             if raise_status:
---> 85                 raise ServiceError(await response.text(), str(response.url)) from ex
     86             return uid, None
     87 

[/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py](https://localhost:8080/#) in text(self, encoding, errors)
   1146             encoding = self.get_encoding()
   1147 
-> 1148         return self._body.decode(  # type: ignore[no-any-return,union-attr]
   1149             encoding, errors=errors
   1150         )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 31378: invalid start byte

Anything else we need to know?

No response

Environment

SYS INFO -------- commit: None python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.1.58+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') PACKAGE VERSION ------------------------------- async-retriever 0.16.0 pygeoogc 0.16.1 pygeoutils 0.16.1 py3dep N/A pynhd 0.16.2 pygridmet N/A pydaymet N/A hydrosignatures 0.16.0 pynldas2 N/A pygeohydro 0.16.0 aiohttp 3.9.3 aiohttp-client-cache 0.11.0 aiosqlite 0.20.0 cytoolz 0.12.3 ujson 5.9.0 defusedxml 0.7.1 joblib 1.3.2 multidict 6.0.5 owslib 0.30.0 pyproj 3.6.1 requests 2.31.0 requests-cache 1.2.0 shapely 2.0.3 url-normalize 1.4.3 urllib3 2.0.7 yarl 1.9.4 geopandas 0.13.2 netcdf4 1.6.5 numpy 1.25.2 rasterio 1.3.9 rioxarray 0.15.3 scipy 1.11.4 xarray 2023.7.0 click 8.1.7 pyflwdir N/A networkx 3.2.1 pyarrow 14.0.2 folium 0.14.0 h5netcdf 1.3.0 matplotlib 3.7.1 pandas 2.0.3 numba 0.58.1 bottleneck N/A py7zr N/A pyogrio N/A -------------------------------
cheginit commented 7 months ago

Thanks for reporting the issue.

The issue seems to be related to the handling of a CSV file encoding from StreamCat dataset. I pushed a fix to pynhd, tested with your example and it works now. The fix will be available in the next release (hopefully, by the end of next week). In the meanwhile, you can install pynhd from git in your working environment:

pip install --no-deps git+https://github.com/hyriver/pynhd
qideng7 commented 7 months ago

Perfect! thank you!