pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

Closed keewis closed 2 months ago

keewis commented 3 months ago

netcdf4 didn't have numpy 2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.

keewis commented 3 months ago

@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a DeprecationWarning for quite a while and raises an OverflowError in numpy>=2. I fear I don't know enough about our encoding operations to debug this.

(in case you know what the issue is, feel free to push a fix to this PR)

kmuehlbauer commented 3 months ago

@keewis Yes, I'll have a closer look tomorrow.

kmuehlbauer commented 3 months ago

Adding error log for verbosity:

FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
kmuehlbauer commented 3 months ago

OK, that looks like some error using a combination of packed data and _Unsigned-attribute. I'm currently lacking a upstream-dev environment on my laptop and WiFi is bad while traveling.

Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values.

kmuehlbauer commented 3 months ago

@keewis I have a local reproducer now, looking into this later today.

kmuehlbauer commented 3 months ago

Citing from above link: NetCDF-3 does not have unsigned integer primitive types.

The _Unsigned attribute was introduced, to notify any reading application, that the signed data on disk should be treated as unsigned in memory. Nevertheless the _FillValue need to be of on-disk type (signed).

The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct _FillValue within the test and use view instead of cast for transformation from uint -> int.

keewis commented 3 months ago

thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)!

Just so I understand the implications of using view: we will only ever encounter numpy (or cupy, which aims to be a drop-in replacement) arrays here, right?

Edit: the failing upstream-dev CI is unrelated, that's a scipy deprecation.

kmuehlbauer commented 3 months ago

Now that you ask that, I'm a bit unsure.

_FillValue is an attribute here, a scalar, so no arrays involved. To transform from uint to int, I've wrapped it into numpy scalar to use view and assigned it back to the attribute. Should work, but there might be another (better) solution I'm missing.

keewis commented 3 months ago

revisiting this, I believe this is fine, though this does fail if for example data is backed by a cupy array. I couldn't get that to work even with main and numpy<2, so maybe this doesn't matter. @dcherian, any opinions? Otherwise this should be ready to merge (edit: and really, this is blocking a bunch of PRs, so merging sooner than later would be great).

kmuehlbauer commented 3 months ago

@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today.

keewis commented 3 months ago

well, feel free to improve/modify as you see fit!

kmuehlbauer commented 3 months ago

UnsignedCoder again:

That's a really nasty thing. This is what netcdf4-python tells us about it:

_In addition, if maskandscale is set to True, and if the variable has an attribute _Unsigned set to "true", and the variable has a signed integer data type, a view to the data is returned with the corresponding unsigned integer data type. This convention is used by the netcdf-java library to save unsigned integer data in NETCDF3 or NETCDF4CLASSIC files (since the NETCDF3 data model does not have unsigned integer data types).

So I think, that a view is probably the right thing to do at that location (can't think of anything better). I've had to tweak also the encoding side of things to correctly roundtrip. I've added another test to be sure everything works as expected with plain _Unsigned="true".

keewis commented 3 months ago

the failing doctests CI will be fixed by #9177 (no idea about mypy). So if the implementation of the decoder fix is fine, this should finally be ready for merging.

dcherian commented 2 months ago

Presumably the numpy version in the mypy env changed? I'm OK with addressing that later. Shall we merge?

keewis commented 2 months ago

the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR.

dcherian commented 2 months ago

I just didn't want to merge my own PR

I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased.

djhoese commented 2 months ago

FYI this PR seems to break the case where _FillValue is not the same type as the array. I haven't completely tracked it down, but I'm getting failures in my unstable CI with:

            if "_FillValue" in attrs:
                new_fill = np.array(attrs["_FillValue"])
                # use view here to prevent OverflowError
>               attrs["_FillValue"] = new_fill.view(signed_dtype).item()
E               ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged
djhoese commented 2 months ago

Here's a reproducer:

import xarray as xr
import numpy as np

v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})

uic = xr.coding.variables.UnsignedIntegerCoder()

uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

Basically I'm passing 32-bit float data so signed_dtype becomes int32 in the encode method. So np.array(-1).view(np.int32) is not allowed as np.array(-1) is a 64-bit integer. Even if I pass fill value as a int16 (matching my output encoding) it still fails because it is trying to use a signed_dtype of int32 to match the 32-bit float I'm passing in.

Edit: If I pass fill value as np.array(-1, dtype=np.int32) then it seems to work.

kmuehlbauer commented 2 months ago

This works for me in latest main branch.

import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")

@djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use uint16 as is when writing to NETCDF4.

djhoese commented 2 months ago

@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types).

djhoese commented 2 months ago

It seems I can do fill value as np.array(65535, dtype=np.uint16) but then it does show up in the resulting Variable's attrs as 65535. Oh I can do plain 65535 too. I haven't tested what this shows up as in a final NetCDF file.

Edit: As far as I can find the _FillValue is always supposed to match the dtype of the packed on-disk variable. Specifying 65535 makes this not true.

djhoese commented 2 months ago

Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need _FillValue to match the dtype of the on-disk/in-file variable. Xarray is assuming here that the _FillValue is in the type of the in-memory data or at least that's what I'm telling myself. My workaround is to pass the -1 as the unsigned equivalent (65535) and then xarray does the conversion to the on-disk type (signed int) "for me".

In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better.

You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying _FillValue as an on-disk-dtype compatible value then that never should have happened, right?

kmuehlbauer commented 2 months ago

Providing the signed int16 equivalent (-1) as _FillValue should work, as in my example above. Does this not work?

kmuehlbauer commented 2 months ago

This seems to work perfectly fine:

import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
<xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
    _FillValue:  -1
    _Unsigned:   true

You can specify _FillValue either as uint or int, it will be converted to packed type dtype (int) using .view (this was introduced in this PR, since numpy2 errors now when just trying to cast).

djhoese commented 2 months ago

Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for _FillValue. Regular python integers should work too.

djhoese commented 2 months ago

@kmuehlbauer Further up you had said that the change to the new view logic was to avoid overflow complaints from numpy 2. In those cases, what was being provided as the _FillValue?

djhoese commented 2 months ago

@kmuehlbauer I have another case I just discovered after working around the -1 case as discussed. I have a dataset coming from somewhere else that is uint8 and a value of 1 is the fill value. The variable needs to be stored as a signed int8 with _Unsigned set to make the third-party application happy.

In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})

In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")

File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
    522     new_fill = np.array(attrs["_FillValue"])
    523     # use view here to prevent OverflowError
--> 524     attrs["_FillValue"] = new_fill.view(signed_dtype).item()
    525     # new_fill = signed_dtype.type(attrs["_FillValue"])
    526     # attrs["_FillValue"] = new_fill
    527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)

ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

So the above fails because new_fill = np.array(attrs["_FillValue"]) is taking a python native integer and making it an np.int64. Now the .view(signed_dtype) logic fails because it is trying to cast a 64-bit integer to a 8-bit integer.

Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly.

kmuehlbauer commented 2 months ago

@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying :roll_eyes:.

I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead.

djhoese commented 2 months ago

I'll see what I can do. I think this issue I've brought up can be summarized as: _FillValue should be able to be a python native integer or numpy scalar. I think the main reason it isn't working is the np.array(attrs["_FillValue"]) because it takes a python native integer and turns it into a int64 most of the time.

djhoese commented 2 months ago

@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this _Unsigned case, I'm not sure it is a clear answer. Does the user always provide a _FillValue of the output on-disk type? Or is xarray allowed to normalize any fill value to the on-disk type? If it does that normalization, wrote it to a NetCDF file, and read it back in then (I think) it wouldn't have the original _FillValue the user started with.

In the original create_unsigned_masked_scaled_data test helper function the on-disk type was int8 (i1) and the _FillValue was 255 in the encoding dict. @kmuehlbauer changed that fill value to np.int8(-1) which makes sense if the encoding of a Dataset represents what will live on-disk when the data is encoded...but what lives in .attrs["_FillValue"] when it is decoded/loaded from disk? This is confusing.

Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with _Unsigned set to true. That signed integer data should use -1 to mark invalid data. So in any .encoding dictionaries .encoding["_FillValue"] should be -1 to match the on-disk type. In .attrs["_FillValue"] should also always be -1, right? Unless the user is reading poorly formatted data, the _FillValue should never be 255. Right?

Maybe I should get more sleep before dealing with CF.

dcherian commented 2 months ago

Seems like the dtype of _FillValue should match the dtype of the array on disk (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data)

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. ... The missing values of a variable with scale_factor and/or add_offset attributes (see Section 8.1, "Packed Data") are interpreted relative to the variable’s external values (a.k.a. the packed values, the raw values, the values stored in the netCDF file), not the values that result after the scale and offset are applied.

djhoese commented 2 months ago

Yes, but then what is the point of the decode logic currently? The existing logic is to convert _FillValue to the unsigned type that the data has been unpacked as (so the in-memory version). Besides backwards compatibility, if _FillValue does not match the dtype of the in-memory unpacked data then users can't do data == fill_value masking if they turned auto masking off.

kmuehlbauer commented 2 months ago

I've worked on the CF coding part in the past and found it not easy (huh). The idea was to have different coders for the different parts of the CF coding pipeline.

The problem is that a priori knowledge is needed for some Coders depending on other Coders. Here in this example in the decoding path the unsigned representation of _FillValue is needed in subsequent CFMaskCoder.

We can't just run CFMaskCoder before UnsignedIntegerCoder as we would already transform to float there.

Maybe we should handle the complete pipeline for _Unsigned in one Coder?

dcherian commented 2 months ago

we should handle the complete pipeline for _Unsigned in one Coder?

This would be fine.

djhoese commented 2 months ago

@kmuehlbauer But isn't the CFMaskCoder run first?

    for coder in [
        times.CFDatetimeCoder(),
        times.CFTimedeltaCoder(),
        variables.CFScaleOffsetCoder(),
        variables.CFMaskCoder(),
        variables.UnsignedIntegerCoder(),
        variables.NativeEnumCoder(),
        variables.NonStringCoder(),
        variables.DefaultFillvalueCoder(),
        variables.BooleanCoder(),
    ]:
        var = coder.encode(var, name=name)

I think I agree with the current behavior that the in-memory .attrs["_FillValue"] should match the in-memory array's dtype (uint in our examples so far). Once the Variable/DataArray is encoded then _FillValue in the .attrs will match the on-disk dtype. It looks like _FillValue is never meant to appear in .encoding so maybe I just forget about that extra complication.

kmuehlbauer commented 2 months ago

@djhoese Yes, for encoding. For decoding the other way round.

djhoese commented 2 months ago

First try: https://github.com/pydata/xarray/pull/9258