Closed gzt5142 closed 1 year ago
I have a jupyter notebook with a fuller example, if that would help....
It seems that the issue is related to the way strings are treated in xarray
. For compliance with netcdf strings are stored as unicode types with length. So, the only workaround that I can think of for your case is to use the max
function to get the longest width and cast all to the same type. For example:
dtype = max(site_B['station_nm'].dtype, site_A['station_nm'].dtype)
site_A['station_nm'] = site_A['station_nm'].astype(dtype)
site_B['station_nm'] = site_B['station_nm'].astype(dtype)
assert site_A['station_nm'].dtype == site_B['station_nm'].dtype
You can do the same for any other string variables that have this issue.
Thx for the info... In my particular situation, I don't know what the max length is until I poll every gage. Only then could I set the string length encoding in the zarr store.
It looks like this is a nuance specific to my use case, so I will implement a workaround with typecasting. I will arbitrarily set up the store with a type of '<U64' for station name with the hopes that there is not a gage with a name over that length.... then typecast to 64 when writing to the store.
To be conservative, you can download the Excel file of the GagseII data set from here, so you can get the max length of the station name field in the whole dataset.
o be conservative, you can download the Excel file of the GagseII data set from here, so you can get the max length of the station name field in the whole dataset.
That's really helpful... thanks for the pointer. This doesn't seem like it needs to be an issue for pygeohydro/NWIS -- I'll figure the workaround on my end. Thanks again for your input and help.
Sure, good luck!
What happened: Repeated calls to
get_streamflow()
returning an xarray DataSet have different dtypes for some fields (notably, strings).What you expected to happen: The returned encodings/schema would be consistent for all calls, and match the internal schema of the NWIS database from which the data is fetched.
Minimal Complete Verifiable Example:
Anything else we need to know?: This has come up for me as I try to fetch streamflow data one gage at a time as part of a parallelized workflow -- each worker fetches one streamgage, manipulates it, then appends to a common dataset (in my case, a zarr store). The common zarr store was templated using
NWIS.get_streamflow()
data, which established the 'standard' dtypes.The dtypes for these particular fields (
station_nm
andalt_datum_cd
) are unicode strings, with the length of the string (and the dtype) being that of the returned data for a given request. That is, the dtype for Site_A'salt_datum_cd
(above) is '<U6' because the data happens to be 6 chars for that gage. For Site_B'salt_datum_cd
, the dtype is '<U1'. It isn't just that the string is shorter, the dtype is different, which causes the zarr write to fail.I can work around this by re-casting in the case of these two strings:
But in the case of the station name field, I don't know what the max length might be from the database. I can cast to '<U46' (the dtype for Site_A's station_nm), but other gages may have longer names, which will be truncated when cast to this dtype.
It would be useful to have
get_streamflow()
return the same string encoding/dtype in all cases, so that separate calls can be treated identically.Environment:
Output of pygeohydro.show_versions()