Unintended/wrong behaviour of getitem method in HDF5FileHandler?

adybbroe commented 5 years ago

Describe the bug A clear and concise description of what the bug is.

I was creating a reader which is based on (inherits from) the HDF5FileHandler, when we encountered a problem when manipulating the attributes to self[file_key]. Next time I called self[file_key] the changes made to the attributes were gone (overwritten). The problem appeared to be that the __getitem__ method of the class HDF5FileHandler didn't store the new value in self.file_content before returining:

    def __getitem__(self, key):
        val = self.file_content[key]
        if isinstance(val, h5py.Dataset):
            # these datasets are closed and inaccessible when the file is closed, need to reopen
            dset = h5py.File(self.filename, 'r')[key]
            dset_data = da.from_array(dset, chunks=CHUNK_SIZE)
            if dset.ndim == 2:
                return xr.DataArray(dset_data, dims=['y', 'x'], attrs=dset.attrs)
            return xr.DataArray(dset_data, attrs=dset.attrs)

        return val

Probably this should be changed to:

    def __getitem__(self, key):
        val = self.file_content[key]
        if isinstance(val, h5py.Dataset):
            # these datasets are closed and inaccessible when the file is closed, need to reopen
            dset = h5py.File(self.filename, 'r')[key]
            dset_data = da.from_array(dset, chunks=CHUNK_SIZE)
            if dset.ndim == 2:
                val = xr.DataArray(dset_data, dims=['y', 'x'], attrs=dset.attrs)
            else:
                val = xr.DataArray(dset_data, attrs=dset.attrs)

            self.file_content[key] = val

        return val

Agree?

To Reproduce

# Your code here

Expected behavior A clear and concise description of what you expected to happen.

Actual results Text output of actual results or error messages including full tracebacks if applicable.

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info:

OS: [e.g. OSX, Windows, Linux]
Satpy Version: [e.g. 0.9.0]
PyResample Version:
Readers and writers dependencies (when relevant): [run from satpy.config import check_satpy; check_satpy()]

Additional context Add any other context about the problem here.

djhoese commented 5 years ago

This behavior is intended. Files are read-only and should be treated as such. The way you are using it is essentially using the file as a cache for your own changes. I'd say this is against the intended use of file handlers.

What attributes do you need cached?

mraspaud commented 5 years ago

While I agree that the files are read-only, the returned object is a xr.DataArray, which is modifiable. The problem @adybbroe had was that hdf5 references are not easily handled (crashed on deepcopy), so we wanted to remove the references from the object with something like:

for key in list(self['file_key'].attrs.keys()):
    if isinstance(self['file_key'].attrs[key], h5.reference):
        self['file_key'].attrs.pop(key)
return self['file_key']

While this is maybe a little clumsy (assigning self['file_key'] to a variable would have been clearer), the fact is that the getitem call self['file_key'] returns a new xr.DataArray object everytime, with a fresh attrs dictionary. The objective here is not to modify the file, but to clean up the dataset before it leaves the reader, as we probably do in other readers to make the data more streamlined. What bothers me here, is that the behaviour of self['file_key'] isn't intuitive based on the usual python getitems. If we aren't supposed to modify the returned DataArray, then I would like to raise an exception when the .attrs dictionary is modified, but I think it's clear such behaviour would cause more harm than it's worth. The way it is now imho is just prone to bugs in long term.

djhoese commented 5 years ago

This reference object checking should be added to the base class and possibly controlled by a keyword argument (default to removing them). I've run in to this with the new OMPS data (HDF5 files that are actually NetCDF4 files with dimension information encoded in the attributes).
How are you structuring the code that this modification needs to be done twice (why is the variable accessed more than once?)?
The DataArray is created every time and not reassigned to the base file handler because I wanted to avoid keeping the HDF5 file object open. The file is reopened to make the DataArray and is tied to the dask array, at least that's the idea.
If you were reading the raw HDF5 file, you would be opening it as read-only and any change to .attrs would cause an error. Or you would open it in read/write and any change to the .attrs would be affecting the file.

The assumption with the design is that you access a variable once, use it once. I realize that this isn't guaranteed or checked in the code, but I don't think it is an unreasonable assumption either.

mraspaud commented 5 years ago

agreed
sorry, I don't understand what you mean
Ok, so you mean that even if the reader is still hanging in memory, as long as the DataArray is deleted, the hdf5 file is closed ? that makes sense. However, could we use a weakref instead ? that way the same dataarray could be returned if it's still in memory somewhere ?
Yes, agreed. But that's just the role of the reader to turn the filecontent into a DataArray the user can work with, totally decoupled from the file. At least that's how I see it.

In summary, I get your point on why it is the way it is, and it's sane. It's just not very intuitive imo.

djhoese commented 5 years ago

In the subclass file handler that is being created for this specific reader, why is self[var_name] being accessed more than once? Why isn't it:
```
data_arr = self[var_name]
# update attributes
return data_arr
```
Weakref for the DataArray or the file's Variable object (or whatever h5py calls it)? Wouldn't this get garbage collected right away since you are running in to this issue because the DataArray isn't being held on to in the subclass file handler?
"the user can work with", is the "user" in this case the Scene user or the reader developer? A Scene user should not be able to do scn[ds].attrs['new_attribute'] = 'X' and affect the file handler's data underneath.

mraspaud commented 5 years ago

as mentioned before, this could have be done better in our code obviously, in exactly the way that you show.
yes, good point.
The user would be the Scene user, and yes, that's what I meant by decoupling the file from the dataset.

My objective here is to make the code more intuitive. I guess you won't like that, but how about replacing __getitem__ with a get method, so that we remove the square brackets ? Or am I the only one expecting a syntax like bla[something] to return the same object everytime ?

djhoese commented 5 years ago

Technically bla[something] does return the same object every time 😄

If you were going to replace it then it should be replaced with get_data_array or something since get has an existing purpose in python. That said, you and @adybbroe may be the only ones thinking about it this way. I think of self[key] as accessing the underlying file and think of it as a copy of the contents of the file. It should either be a copy of what's in the file or a read-only version of the file content. Since the read-only version isn't possible with xarray (without subclassing) then it is a copy of the file content.

Maybe I've just gotten used to writing readers that only access self[key] once and return the contents. Also keep in mind that get_variable or whatever would have to avoid name collision with common names that reader developers might use for things since we're dealing with subclasses.

adybbroe commented 5 years ago

Oh! Well, the thing is that I am reading HDF5 files using hdf5-references, and currently they cause some headache further down the chain - resampling for instance. So, in my reader I am removing those attributes. Here is a piece of code that works with the current version:

    def __init__(self, filename, filename_info, filetype_info):
        """Init method."""
        super(Hdf5NWCSAF, self).__init__(filename, filename_info,
                                         filetype_info)

        self.cache = {}

    def get_dataset(self, dataset_id, ds_info):
        """Load a dataset."""
        file_key = ds_info.get('file_key', dataset_id.name)
        data = self[file_key]
        if 'SCALING_FACTOR' in data.attrs and 'OFFSET' in data.attrs:
            dtype = np.dtype(data.data)
            data.data = (data.data * data.attrs['SCALING_FACTOR'] + data.attrs['OFFSET']).astype(dtype)

        for key in list(data.attrs.keys()):
            val = data.attrs[key]
            if isinstance(val, h5py.h5r.Reference):
                del data.attrs[key]

        return data

At first I did:

        for key in list(self[file_key].attrs.keys()):
            val = self[file_key].attrs[key]
            if isinstance(val, h5py.h5r.Reference):
                del self[file_key].attrs[key]

...and couldn't understand why I didn't manage to remove the attribute. Took Martin and me some time to figure out why...

adybbroe commented 5 years ago

Oh (again) didn't see that you already discussed quite far with @mraspaud

djhoese commented 5 years ago

@adybbroe What are the names of the attributes that are problems? They may be NetCDF files like the OMPS files I was dealing with.

adybbroe commented 5 years ago

@djhoese Here is the draft PR where this was an issue: https://github.com/pytroll/satpy/pull/886 Not sure I understand your question. The data I am reading is a hdf5 file (NWCSAF/MSG formattet cloud products). The one that is a reference is nameed PALETTE, and it is needed for hdfview to generate a nice looking image (applying the color palette to the dataset) of say the cloud pressure

djhoese commented 5 years ago

Ah ok, then it probably isn't. I mentioned this in the satpy issue about OMPS. Basically NetCDF4 files, which are a form of HDF5 files, will add special attributes for figuring out dimensions and things. If you read the NetCDF4 file as an HDF5 file then these attributes show up as References like you are seeing. This was happening for new OMPS files because they named their NetCDF4 files with .h5. The reader expected these to be formatted the same as the older HDF5 format files, but they are not. The end result was a serialization error (SystemError iirc) during resampling like you are seeing.

adybbroe commented 5 years ago

And @djhoese I also better understand the motivation for the current behaviour. But I am afraid I am not the only one who will be caught falling into this trap again. In my code above I am changing data which points to self[file_key], but self[file_key] doesn't change...

adybbroe commented 5 years ago

@djhoese: What do you mean by "Ah ok, then it probably isn't"?

djhoese commented 5 years ago

Then it probably isn't a NetCDF file being read as an HDF5 file.

For your code, why does it matter if self[file_key] isn't being updated. You already have access to data, you update it (remove the bad attributes), then return it. When/why is self[file_key] (for that specific file_key) being accessed again? Meaning, is there a method other than get_dataset that is loading that dataset?

adybbroe commented 5 years ago

@djhoese I think I am quite good with the code as is for my case. I was just saying the behaviour of __getitem__ fooled me here, so I could do the same mistake in the future. But I think my code as it is now is fine, and better than the (clumsy) example I showed above that didn't work.

adybbroe commented 5 years ago

And @djhoese it is an hdf5 (not netCDF4) file for sure! I was once part in the development of that format using a library we created at SMHI back ages ago: HL-HDF=Higher Level HDF :-) The files I am reading was generated with HL-HDF and the hdf5 lib (C-code).

pytroll / satpy

Unintended/wrong behaviour of getitem method in HDF5FileHandler? #888