Closed adybbroe closed 4 years ago
This behavior is intended. Files are read-only and should be treated as such. The way you are using it is essentially using the file as a cache for your own changes. I'd say this is against the intended use of file handlers.
What attributes do you need cached?
While I agree that the files are read-only, the returned object is a xr.DataArray
, which is modifiable. The problem @adybbroe had was that hdf5 references are not easily handled (crashed on deepcopy), so we wanted to remove the references from the object with something like:
for key in list(self['file_key'].attrs.keys()):
if isinstance(self['file_key'].attrs[key], h5.reference):
self['file_key'].attrs.pop(key)
return self['file_key']
While this is maybe a little clumsy (assigning self['file_key']
to a variable would have been clearer), the fact is that the getitem call self['file_key']
returns a new xr.DataArray object everytime, with a fresh attrs dictionary.
The objective here is not to modify the file, but to clean up the dataset before it leaves the reader, as we probably do in other readers to make the data more streamlined.
What bothers me here, is that the behaviour of self['file_key']
isn't intuitive based on the usual python getitems. If we aren't supposed to modify the returned DataArray, then I would like to raise an exception when the .attrs
dictionary is modified, but I think it's clear such behaviour would cause more harm than it's worth. The way it is now imho is just prone to bugs in long term.
.attrs
would cause an error. Or you would open it in read/write and any change to the .attrs
would be affecting the file.The assumption with the design is that you access a variable once, use it once. I realize that this isn't guaranteed or checked in the code, but I don't think it is an unreasonable assumption either.
In summary, I get your point on why it is the way it is, and it's sane. It's just not very intuitive imo.
In the subclass file handler that is being created for this specific reader, why is self[var_name]
being accessed more than once? Why isn't it:
data_arr = self[var_name]
# update attributes
return data_arr
Weakref for the DataArray or the file's Variable object (or whatever h5py calls it)? Wouldn't this get garbage collected right away since you are running in to this issue because the DataArray isn't being held on to in the subclass file handler?
"the user can work with", is the "user" in this case the Scene
user or the reader developer? A Scene
user should not be able to do scn[ds].attrs['new_attribute'] = 'X'
and affect the file handler's data underneath.
Scene
user, and yes, that's what I meant by decoupling the file from the dataset.My objective here is to make the code more intuitive. I guess you won't like that, but how about replacing __getitem__
with a get
method, so that we remove the square brackets ? Or am I the only one expecting a syntax like bla[something]
to return the same object everytime ?
Technically bla[something]
does return the same object every time 😄
If you were going to replace it then it should be replaced with get_data_array
or something since get
has an existing purpose in python. That said, you and @adybbroe may be the only ones thinking about it this way. I think of self[key]
as accessing the underlying file and think of it as a copy of the contents of the file. It should either be a copy of what's in the file or a read-only version of the file content. Since the read-only version isn't possible with xarray (without subclassing) then it is a copy of the file content.
Maybe I've just gotten used to writing readers that only access self[key]
once and return the contents. Also keep in mind that get_variable
or whatever would have to avoid name collision with common names that reader developers might use for things since we're dealing with subclasses.
Oh! Well, the thing is that I am reading HDF5 files using hdf5-references, and currently they cause some headache further down the chain - resampling for instance. So, in my reader I am removing those attributes. Here is a piece of code that works with the current version:
def __init__(self, filename, filename_info, filetype_info):
"""Init method."""
super(Hdf5NWCSAF, self).__init__(filename, filename_info,
filetype_info)
self.cache = {}
def get_dataset(self, dataset_id, ds_info):
"""Load a dataset."""
file_key = ds_info.get('file_key', dataset_id.name)
data = self[file_key]
if 'SCALING_FACTOR' in data.attrs and 'OFFSET' in data.attrs:
dtype = np.dtype(data.data)
data.data = (data.data * data.attrs['SCALING_FACTOR'] + data.attrs['OFFSET']).astype(dtype)
for key in list(data.attrs.keys()):
val = data.attrs[key]
if isinstance(val, h5py.h5r.Reference):
del data.attrs[key]
return data
At first I did:
for key in list(self[file_key].attrs.keys()):
val = self[file_key].attrs[key]
if isinstance(val, h5py.h5r.Reference):
del self[file_key].attrs[key]
...and couldn't understand why I didn't manage to remove the attribute. Took Martin and me some time to figure out why...
Oh (again) didn't see that you already discussed quite far with @mraspaud
@adybbroe What are the names of the attributes that are problems? They may be NetCDF files like the OMPS files I was dealing with.
@djhoese Here is the draft PR where this was an issue:
https://github.com/pytroll/satpy/pull/886
Not sure I understand your question. The data I am reading is a hdf5 file (NWCSAF/MSG formattet cloud products). The one that is a reference is nameed PALETTE
, and it is needed for hdfview to generate a nice looking image (applying the color palette to the dataset) of say the cloud pressure
Ah ok, then it probably isn't. I mentioned this in the satpy issue about OMPS. Basically NetCDF4 files, which are a form of HDF5 files, will add special attributes for figuring out dimensions and things. If you read the NetCDF4 file as an HDF5 file then these attributes show up as References like you are seeing. This was happening for new OMPS files because they named their NetCDF4 files with .h5
. The reader expected these to be formatted the same as the older HDF5 format files, but they are not. The end result was a serialization error (SystemError
iirc) during resampling like you are seeing.
And @djhoese I also better understand the motivation for the current behaviour. But I am afraid I am not the only one who will be caught falling into this trap again. In my code above I am changing data
which points to self[file_key]
, but self[file_key]
doesn't change...
@djhoese: What do you mean by "Ah ok, then it probably isn't"?
Then it probably isn't a NetCDF file being read as an HDF5 file.
For your code, why does it matter if self[file_key]
isn't being updated. You already have access to data
, you update it (remove the bad attributes), then return it. When/why is self[file_key]
(for that specific file_key
) being accessed again? Meaning, is there a method other than get_dataset
that is loading that dataset?
@djhoese I think I am quite good with the code as is for my case. I was just saying the behaviour of __getitem__
fooled me here, so I could do the same mistake in the future. But I think my code as it is now is fine, and better than the (clumsy) example I showed above that didn't work.
And @djhoese it is an hdf5 (not netCDF4) file for sure! I was once part in the development of that format using a library we created at SMHI back ages ago: HL-HDF=Higher Level HDF :-) The files I am reading was generated with HL-HDF and the hdf5 lib (C-code).
Describe the bug A clear and concise description of what the bug is.
I was creating a reader which is based on (inherits from) the
HDF5FileHandler
, when we encountered a problem when manipulating the attributes toself[file_key]
. Next time I calledself[file_key]
the changes made to the attributes were gone (overwritten). The problem appeared to be that the__getitem__
method of the classHDF5FileHandler
didn't store the new value inself.file_content
before returining:Probably this should be changed to:
Agree?
To Reproduce
Expected behavior A clear and concise description of what you expected to happen.
Actual results Text output of actual results or error messages including full tracebacks if applicable.
Screenshots If applicable, add screenshots to help explain your problem.
Environment Info:
from satpy.config import check_satpy; check_satpy()
]Additional context Add any other context about the problem here.