h5py / h5py

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
http://www.h5py.org
BSD 3-Clause "New" or "Revised" License
2.09k stars 532 forks source link

ExternalLink does not work if the referenced file is a symlink of filesystem #860

Open heroxbd opened 7 years ago

heroxbd commented 7 years ago

In my set up, shar/sel0/mpro1/Plate_Pb210/011835.h5 has a "tt" field that is an ExternalLink to the "tt" field of tt/sel0/mpro1/Plate_Pb210/011835.h5. If the target is a symlink, h5py cannot resolve ExternalLink correctly.

$ ls -l tt/sel0/mpro1/Plate_Pb210/011835.h5  
208 Mar 24 15:51 tt/sel0/mpro1/Plate_Pb210/011835.h5 -> ../../../../.git/annex/objects/09/g5/SHA256E-s2226611--0d3aa2bc44bab5eef469ede96da9a457d8171d3bc80d2d7903d2c1b592baba7a.h5/SHA256E-s2226611--0d3aa2bc44bab5eef469ede96da9a457d8171d3bc80d2d7903d2c1b592baba7a.h5
$ ptdump shar/sel0/mpro1/Plate_Pb210/011835.h5
/ (RootGroup) ''
/pl (CArray(634,), zlib(7)) ''
/shar (Table(51,), zlib(7)) ''
/evt (ExternalLink) -> tt/sel0/mpro1/Plate_Pb210/011835.h5:evt
/tt (ExternalLink) -> tt/sel0/mpro1/Plate_Pb210/011835.h5:tt

$ python -c "import h5py; f=h5py.File('shar/sel0/mpro1/Plate_Pb210/011835.h5')['tt']" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/dev/shm/portage/dev-python/h5py-2.6.0/work/h5py-2.6.0/h5py/_objects.c:2577)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/dev/shm/portage/dev-python/h5py-2.6.0/work/h5py-2.6.0/h5py/_objects.c:2536)
  File "/fefs/disk/usr100/gentoo/usr/lib64/python2.7/site-packages/h5py/_hl/group.py", line 166, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/dev/shm/portage/dev-python/h5py-2.6.0/work/h5py-2.6.0/h5py/_objects.c:2577)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/dev/shm/portage/dev-python/h5py-2.6.0/work/h5py-2.6.0/h5py/_objects.c:2536)
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/dev/shm/portage/dev-python/h5py-2.6.0/work/h5py-2.6.0/h5py/h5o.c:3546)
KeyError: "Unable to open object (Unable to open file: name = 'shar/sel0/mpro1/plate_pb210/tt/sel0/mpro1/plate_pb210/011835.h5', errno = 2, error message = 'no such file or directory', flags = 1, o_flags = 2)"

But rhdf5 package of R, which links to hdf5 directly, can,

$ R
> require(rhdf5)
> tt = h5read("shar/sel0/mpro1/Plate_Pb210/011835.h5", "tt")

It works also if the target is a normal file.

$ git annex unlock tt/sel0/mpro1/Plate_Pb210/011835.h5
$ ls -l tt/sel0/mpro1/Plate_Pb210/011835.h5  
tt/sel0/mpro1/Plate_Pb210/011835.h5 (not a symlink anymore)
$ python -c "import h5py; f=h5py.File('shar/sel0/mpro1/Plate_Pb210/011835.h5')['tt']" 
(works)

Versions: h5py-2.6.0 and 2.7.0 both tested hdf5-1.8.18 rhdf5-2.14.0 linked to system hdf5-1.8.18 python-2.7.12 R-3.3.2

sergsb commented 7 years ago

I have the same problem, I have the real name of the file BCF_training_1/BCF_training_516_gcr.hdf but it is symlink, and while trying to open it I have IOError: Unable to open file (Unable to open file: name = 'bcf_training_1/bcf_training_516_gcr.hdf', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

tacaswell commented 7 years ago

I think the issue here is slightly mal-formed hdf5 files and the symlink issue is a red-herring, the actual issue is the relative paths and the dependence on the current working directory.

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-CreateExternal gives the lookup rules for external links which are

If target_file_name is a relative pathname, the following steps are performed:

  • The library will get the prefix(es) set in the environment variable HDF5_EXT_PREFIX and will try to prepend each prefix to target_file_name to form a new target_file_name.
  • If the new target_file_name does not exist or if HDF5_EXT_PREFIX is not set, the library will get the prefix set via H5Pset_elink_prefix and prepend it to target_file_name to form a new target_file_name.
  • If the new target_file_name does not exist or no prefix is being set by H5Pset_elink_prefix, then the path of the file associated with link_loc_id is obtained. This path can be the absolute path or the current working directory plus the relative path of that file when it is created/opened. The library will prepend this path to target_file_name to form a new target_file_name.
  • If the new target_file_name does not exist, then the library will look for target_file_name and will return failure/success accordingly.

If the working directory is 'right' it will work (and I am not sure why the OP example does not work, given that the cwd should have been set right), but I would not depend on that. Storing relative paths with '..' seems to work more reliably.

from pathlib import Path
import h5py
import numpy as np

def compute_relative_with_dots(a, b):
    try:
        return a.relative_to(b)
    except ValueError:
        pass
    dot_path = Path('..')
    for p in b.parents:
        try:
            rp = a.relative_to(p)
            return dot_path / rp
        except ValueError:
            dot_path /= Path('..')

test_path = Path('/tmp/symlink')
data_path = test_path / 'data' / 'very' / 'deep'
annex_path = test_path / 'annex' / 'very' / 'deep'
access_path = test_path / 'access'

link_file = data_path / 'sym_base.h5'
target_file = annex_path / 'base.h5'

test_file = access_path / 'has_external_link.h5'

test_path.mkdir(exist_ok=True, parents=True)
data_path.mkdir(exist_ok=True, parents=True)
annex_path.mkdir(exist_ok=True, parents=True)
access_path.mkdir(exist_ok=True, parents=True)

with h5py.File(target_file, 'w') as f:
    f['target'] = np.ones(5)

try:
    link_file.unlink()
except FileNotFoundError:
    pass
link_file.symlink_to(compute_relative_with_dots(target_file, data_path))

with h5py.File(test_file, 'w') as f:
    f['in_file'] = np.ones(5) * 3
    f['ext_link'] = h5py.ExternalLink(target_file.relative_to(test_path), 'target')
    f['sym_ext_link'] = h5py.ExternalLink(link_file.relative_to(test_path), 'target')
    f['ext_link_works'] = h5py.ExternalLink(compute_relative_with_dots(target_file, access_path), 'target')
    f['sym_link_works'] = h5py.ExternalLink(compute_relative_with_dots(link_file, access_path), 'target')

def test_with_cwd(cwd_path):
    print('-'*25)
    print('with {} as cwd'.format(cwd_path))
    os.chdir(cwd_path)
    with h5py.File(test_file, 'r') as f:
        for k in ['in_file', 'ext_link', 'sym_ext_link', 'ext_link_works', 'sym_link_works']:
            try:
                print(f[k][:])
            except Exception as e:
                print(k)
                print(e)
    print('-'*25)
print()
test_with_cwd(Path('~').expanduser())
test_with_cwd(test_path)

which gives

-------------------------
with /home/tcaswell as cwd
[ 3.  3.  3.  3.  3.]
ext_link
"Unable to open object (Unable to open file: name = '/tmp/symlink/access/annex/very/deep/base.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)"
sym_ext_link
"Unable to open object (Unable to open file: name = '/tmp/symlink/access/data/very/deep/sym_base.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)"
[ 1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.]
-------------------------
-------------------------
with /tmp/symlink as cwd
[ 3.  3.  3.  3.  3.]
[ 1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.]
-------------------------

the h5 file is:

11:32 $ h5ls symlink/access/has_external_link.h5 
ext_link                 External Link {annex/very/deep/base.h5//target}
ext_link_works           External Link {../annex/very/deep/base.h5//target}
in_file                  Dataset {5}
sym_ext_link             External Link {data/very/deep/sym_base.h5//target}
sym_link_works           External Link {../data/very/deep/sym_base.h5//target}
(dd36) ✔ /tmp 

and the file structure is:

11:32 $ ls -lR symlink/
symlink/:
total 0
drwxr-xr-x 2 tcaswell tcaswell 60 Jun 14 11:32 access
drwxr-xr-x 3 tcaswell tcaswell 60 Jun 14 11:32 annex
drwxr-xr-x 3 tcaswell tcaswell 60 Jun 14 11:32 data

symlink/access:
total 4
-rw-r--r-- 1 tcaswell tcaswell 2184 Jun 14 11:32 has_external_link.h5

symlink/annex:
total 0
drwxr-xr-x 3 tcaswell tcaswell 60 Jun 14 11:32 very

symlink/annex/very:
total 0
drwxr-xr-x 2 tcaswell tcaswell 60 Jun 14 11:32 deep

symlink/annex/very/deep:
total 4
-rw-r--r-- 1 tcaswell tcaswell 2184 Jun 14 11:32 base.h5

symlink/data:
total 0
drwxr-xr-x 3 tcaswell tcaswell 60 Jun 14 11:32 very

symlink/data/very:
total 0
drwxr-xr-x 2 tcaswell tcaswell 60 Jun 14 11:32 deep

symlink/data/very/deep:
total 0
lrwxrwxrwx 1 tcaswell tcaswell 32 Jun 14 11:32 sym_base.h5 -> ../../../annex/very/deep/base.h5
(dd36) ✔ /tmp 

This is with master(ish) h5py, hdf5 1.10, and python 3.6.