nexpy / nexusformat

Provides an API to open, create, plot, and manipulate NeXus data.
Other
13 stars 19 forks source link

Replicate h5py.copy expand_external option #81

Closed toqduj closed 4 years ago

toqduj commented 5 years ago

Hello developers,

For Reasons of standards adherence, I need to transcode NeXus-to-NeXus. In a previous version of my code, I did this using the h5py library directly. However, I suspect that by using nexusformat's API, I can reduce my code and improve NeXus compatibility. So far, things are moving along well.

For copying a group from the old file to the new structure, I use h5py.copy (see http://docs.h5py.org/en/stable/high/group.html). This allows me to automatically include data from external links into the new file using the expand_external flag, so I can do away with the old file's dependencies. This appears not to be supported by nexusformat (yet).

Would it be possible to include this functionality using a flag in the nexusformat copy method?

rayosborn commented 5 years ago

I have just merged PR #87, which adds an nxduplicate script to the installation.

$ nxduplicate -h
usage: nxduplicate [-h] [-e] [-o] [-v] input output

Copy a NeXus file to another file

positional arguments:
  input                 name of NeXus input file
  output                name of NeXus output file

optional arguments:
  -h, --help            show this help message and exit
  -e, --expand_external
                        store external links within the new file
  -o, --overwrite       overwrite any existing file
  -v, --version         show program's version number and exit

In my tests, the -e switch seems to work well. It just utilizes the h5py copy function, so there is no real difference with a direct implementation in h5py, but I think I will find it useful in my own work, so thanks for the suggestion. I hope to release a new version of nexusformat soon.

toqduj commented 5 years ago

Hi @rayosborn, thanks for the addition. However, this does not allow the copying of partial trees from one location in one file to another location in another. I have used this method using h5py for this in the past:

def groupCopy(self, ingroup, outgroup):
    with h5py.File(self._outfile) as h5f, h5py.File(self._infile, 'r') as h5in:
        gClass = h5in.get(ingroup, default = "NonExistentGroup")
        if not isinstance(gClass, h5py.Group):
            print("group {} does not exist in input file".format(ingroup))
            return False

        # ensure output group exists:
        gid = h5f.require_group(outgroup)
        h5in.copy(ingroup, gid, expand_external = True)
    return True # no problems encountered

Would it be possible to extend the nexusformat's "copy" method to allow such group copies? We do this because the Dectris files we get contain large trees full of metadata. I'd like to copy these as is into the final file, without having to traverse the entire thing...

rayosborn commented 5 years ago

I think that would be possible. There are a couple of h5py functions that probably could be mirrored in the nexusformat package. If I implement what you ask for, it would just use the h5py function to accomplish the same task, but there might be some syntactic value in doing it as a NXgroup method.

rayosborn commented 5 years ago

I've been looking into this, but before I commit any new code, I want to understand your use case better. Would something like this do what you want?

>>> input = nxload('infile.nxs', mode='r')
>>> print(input.tree)
input:NXroot
  entry:NXentry
    data:NXdata
      data -> external.nxs['/entry/data/v']
>>> output = nxload('outfile.nxs', mode='a')
>>> output['entry'] = NXentry()
>>> input['entry/data'].copy('output['entry'], expand_external=True)
>>> print(output.tree)
output:NXroot
  entry:NXentry
    data:NXdata
      data = float32(1000x1000x1000)
toqduj commented 5 years ago

Yes, that's approximately what I'd write. It would copy the entire tree (datasets and attributes) beneath input['entry/data']. I would probably use (if possible) the following syntax for clarity:

>>> output['entry/'] = input['entry/data'].copy(expand_external=True)

Taking care to adhere to the syntax used in the Unix copy and move commands regarding addition or omission of trailing slashes... But I could work with your proposed syntax too.

rayosborn commented 5 years ago

I'll look into that. My version is a wrapper for the h5py.Group copy function, which has the source and destination as the first arguments. The only problem with your version is that the destination is not defined in the copy function. That would either require the copy to be placed in an intermediate cache, for example using an HDF5 core memory file, which would be rather inefficient, or for the __setitem__ function that is called by the dictionary assignment to catch that the right-hand-side is a copy. I would have to think how to do that. I like the syntax of your proposal, if it can be realized in practice.

Concerning trailing slashes, you are free to add them to keep your code consistent, but they are stripped off by the nexusformat parser. I am guessing the Unix convention arose because a directory is also a kind of file, whereas in our case, it is not ambiguous what kind of object the path refers to.

tacaswell commented 5 years ago

You could have the copy be lazy and return "shell" object. When you assign an object in __setitem__ there is going to be some step where you tell the RHS what file it is going to be written to etc. Presumably you can defer doing the actual work until that point?

rayosborn commented 5 years ago

I agree. I use a similar idea when assigning large fields to groups that are not yet saved to a file. I put a pointer to the original data when I perform a deepcopy of the field, which is then resolved when the data is saved. I guess I could do the same with the group deepcopy, but I would have to transfer all the keyword arguments used in the copy function. That might not be too hard.

rayosborn commented 5 years ago

I have merged a PR, which contains an implementation of the copy method for both NXgroups and NXfields. Presently, it only works using the syntax you suggest. The copy function returns a skeleton NXobject that is parsed when performing a dictionary assignment.

>>> output['entry/data']=input['entry/data'].copy(expand_external=True)

So it won't work if someone tries, e.g.

>>> output['entry/data']=NXdata(input['entry/data'].copy(expand_external=True))

That would require more code refactoring than I currently have time for.

toqduj commented 5 years ago

Hey, fantastic! Thank you very much for the quick turnaround, I'll use it right away! If ever we meet, I'll get you a nice coffee :).

toqduj commented 4 years ago

I appear to be getting an error when trying to use this. I'm supposed to use this on tree objects, no?

I've used the latest GitHub repo, and installed it using python setup.py install.. However, if the "ge7f6178" is supposed to be a commit signature, then I've done something wrong as it doesn't match the latest one for the library...

self._inTree and self._outTree are initialized from different files.

<ipython-input-13-62ac10af85d9> in groupCopy(self, ingroup, outgroup, expand_external)
    211     def groupCopy(self, ingroup, outgroup, expand_external = True):
    212         # can be used to copy chunks of trees. Will not work on (partially) existing trees
--> 213         self._outTree[outgroup] = self._inTree[ingroup].copy(expand_external=True)
    214 

~/anaconda3/envs/notebook37/lib/python3.7/site-packages/nexusformat-0.4.20+186.ge7f6178-py3.7.egg/nexusformat/nexus/tree.py in __setitem__(self, key, value)
   3883                 if group.nxfilemode is None and value._copyfile is not None:
   3884                     raise NeXusError(
-> 3885                         "Can only copy objects to another NeXus file.")
   3886                 if value._group:
   3887                     value = deepcopy(value)

NeXusError: Can only copy objects to another NeXus file.
rayosborn commented 4 years ago

You have the right commit. It's odd because the error implies that self._outTree is not saved to a file, i.e., self._outTree.nxfilemode show be 'rw' assuming the file is open for writing, not None. There is a pytest that follows exactly your use case. It's the function test_group_copy in nexusformat/tests/test_groups.py. Would you be able to run pytest in the top-level package directory? If that works, I'll need to have more details of how self._outTree is initialized.

toqduj commented 4 years ago

I'll have to get back to this in two weeks.. Might be related to the way it's created, as it's a copy from a NeXus template from another file.

rayosborn commented 4 years ago

This will be closed due to inactivity. If it is still an issue, please reopen with further details.