Open jonathanasdf opened 8 years ago
Hmm, that's a pain. It's probably possible to use a similar workaround to the one described in the thread you linked. Did you find any way around it in the end?
No, I moved over to https://github.com/jonathantompson/torchzlib instead and it has worked for me with no problems.
Was this problem handled by anyone? I am using HDF5 1.8.17 in enable-threadsafe mode, and training with multi-threads loading many hdf5 files during network training. It never seems to release memory and it crashes after a few hours with 'too many open files' although i checked many times that opened files are closed.
HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 140023236646656:
#000: H5F.c line 604 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#001: H5Fint.c line 992 in H5F_open(): unable to open file: time = Wed Nov 2 11:31:31 2016
, name = '00018.h5', tent_flags = 0
major: File accessibilty
minor: Unable to open file
#002: H5FD.c line 993 in H5FD_open(): open failed
major: Virtual File Layer
minor: Unable to initialize object
#003: H5FDsec2.c line 339 in H5FD_sec2_open(): unable to open file: name = '00018.h5', errno = 24, error message = 'Too many open files', flags = 0, o_flags = 0
major: File accessibilty
minor: Unable to open file
The same code doesn't crash with HDF5 1.8.12, it just leaks memory the same way.
I'm having what appears to be a similar issue, but with an earlier version of HDF5:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139866049939328:
#000: ../../../src/H5D.c line 445 in H5Dget_space(): unable to register data space
major: Object atom
minor: Unable to register new atom
#001: ../../../src/H5I.c line 951 in H5I_register(): no IDs available in type
major: Object atom
minor: Out of IDs for group
This is after many iterations of open/read/close on a HDF5 file. Usually the program just hangs forever, I feel like I was "lucky" to even see the error message.
I created a fork of torch-hdf5 which works with HDF5 1.10 (https://github.com/anibali/torch-hdf5/tree/hdf5-1.10), installed HDF5 1.10 with 1.8 API compatibility, and reran the sample program provided by the OP. The program now finishes successfully, whereas before it did not. So either a) the issue is properly fixed in newer versions of HDF5, or b) the new 64-bit IDs in 1.10 have increased the number of available IDs but they will still eventually run out given enough time.
I'm using many data files in hdf5 format to train a neural network. After running for many epochs over a few hours, it crashes with an error
It seems to be a known(?) bug, and exists in both 1.8.14 and 1.8.16 https://stackoverflow.com/questions/35522633/hdf5-consumes-all-resource-ids-for-dataspaces-and-exits-c-api
I can reproduce it with this if I just let it run for a while (to be precise, around 2^24 = 16777216 iterations)
Any ideas? Should I just not use hdf5?