get_streamflow() to_xarray inconsistent dtypes

gzt5142 commented 1 year ago

What happened: Repeated calls to get_streamflow() returning an xarray DataSet have different dtypes for some fields (notably, strings).

What you expected to happen: The returned encodings/schema would be consistent for all calls, and match the internal schema of the NWIS database from which the data is fetched.

Minimal Complete Verifiable Example:

from pygeohydro import NWIS
nwis=NWIS()
DATE_RANGE=("2020-01-01", "2020-12-31")
site_A = nwis.get_streamflow('USGS-402114105350101', DATE_RANGE, to_xarray=True )
site_B = nwis.get_streamflow('USGS-02277600', DATE_RANGE, to_xarray=True )

assert site_A['station_nm'].dtype == site_B['station_nm'].dtype
## fails

assert site_A['alt_datum_cd'].dtype == site_B['alt_datum_cd'].dtype
## fails

Anything else we need to know?: This has come up for me as I try to fetch streamflow data one gage at a time as part of a parallelized workflow -- each worker fetches one streamgage, manipulates it, then appends to a common dataset (in my case, a zarr store). The common zarr store was templated using NWIS.get_streamflow() data, which established the 'standard' dtypes.

The dtypes for these particular fields (station_nm and alt_datum_cd) are unicode strings, with the length of the string (and the dtype) being that of the returned data for a given request. That is, the dtype for Site_A's alt_datum_cd (above) is '<U6' because the data happens to be 6 chars for that gage. For Site_B's alt_datum_cd, the dtype is '<U1'. It isn't just that the string is shorter, the dtype is different, which causes the zarr write to fail.

I can work around this by re-casting in the case of these two strings:

Site_B['alt_datum_cd'] = xr.DataArray(data=Site_B['alt_datum_cd'].values.astype('<U6'), dims='gage_id')

But in the case of the station name field, I don't know what the max length might be from the database. I can cast to '<U46' (the dtype for Site_A's station_nm), but other gages may have longer names, which will be truncated when cast to this dtype.

It would be useful to have get_streamflow() return the same string encoding/dtype in all cases, so that separate calls can be treated identically.

Environment:

Output of pygeohydro.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
[GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.181-99.354.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.12.2
libnetcdf: 4.8.1
aiodns: 3.0.0
aiohttp: 3.8.3
aiohttp-client-cache: 0.7.3
aiosqlite: 0.17.0
async-retriever: 0.3.6
bottleneck: 1.3.5
brotli: installed
cchardet: 2.1.7
click: 8.0.2
cytoolz: 0.12.0
dask: 2022.04.2
defusedxml: 0.7.1
folium: 0.13.0
geopandas: 0.11.1
lxml: 4.9.1
matplotlib: 3.4.3
netCDF4: 1.6.0
networkx: 2.8.7
numpy: 1.23.3
owslib: 0.27.2
pandas: 1.4.2
py3dep: 0.13.6
pyarrow: 9.0.0
pydantic: 1.10.2
pydaymet: 0.13.6
pygeohydro: 0.13.6
pygeoogc: 0.13.6
pygeos: 0.13
pygeoutils: 0.13.6
pynhd: 0.13.6
pyproj: 3.4.0
pytest: None
pytest-cov: None
rasterio: 1.3.2
requests: 2.28.1
requests-cache: 0.9.6
richdem: None
rioxarray: 0.12.2
scipy: 1.9.1
shapely: 1.8.4
tables: 3.7.0
ujson: 5.5.0
urllib3: 1.26.11
xarray: 2022.9.0
xdist: None
yaml: 5.4.1

gzt5142 commented 1 year ago

I have a jupyter notebook with a fuller example, if that would help....

cheginit commented 1 year ago

It seems that the issue is related to the way strings are treated in xarray. For compliance with netcdf strings are stored as unicode types with length. So, the only workaround that I can think of for your case is to use the max function to get the longest width and cast all to the same type. For example:

dtype = max(site_B['station_nm'].dtype, site_A['station_nm'].dtype)
site_A['station_nm'] = site_A['station_nm'].astype(dtype)
site_B['station_nm'] = site_B['station_nm'].astype(dtype)

assert site_A['station_nm'].dtype == site_B['station_nm'].dtype

You can do the same for any other string variables that have this issue.

gzt5142 commented 1 year ago

Thx for the info... In my particular situation, I don't know what the max length is until I poll every gage. Only then could I set the string length encoding in the zarr store.

It looks like this is a nuance specific to my use case, so I will implement a workaround with typecasting. I will arbitrarily set up the store with a type of '<U64' for station name with the hopes that there is not a gage with a name over that length.... then typecast to 64 when writing to the store.

cheginit commented 1 year ago

To be conservative, you can download the Excel file of the GagseII data set from here, so you can get the max length of the station name field in the whole dataset.

gzt5142 commented 1 year ago

o be conservative, you can download the Excel file of the GagseII data set from here, so you can get the max length of the station name field in the whole dataset.

That's really helpful... thanks for the pointer. This doesn't seem like it needs to be an issue for pygeohydro/NWIS -- I'll figure the workaround on my end. Thanks again for your input and help.

cheginit commented 1 year ago

Sure, good luck!

hyriver / pygeohydro

get_streamflow() to_xarray inconsistent dtypes #102