Out of IDs after reading many files

jonathanasdf commented 8 years ago

I'm using many data files in hdf5 format to train a neural network. After running for many epochs over a few hours, it crashes with an error

HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140335388788608:
  #000: H5F.c line 608 in H5Fopen(): unable to atomize file handle
    major: Object atom
    minor: Unable to register new atom
  #001: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group

It seems to be a known(?) bug, and exists in both 1.8.14 and 1.8.16 https://stackoverflow.com/questions/35522633/hdf5-consumes-all-resource-ids-for-dataspaces-and-exits-c-api

I can reproduce it with this if I just let it run for a while (to be precise, around 2^24 = 16777216 iterations)

require 'hdf5'
require 'xlua'
local N = 20000000
local n = '/tmp/test.h5'
local f = hdf5.open(n, 'w')
f:write('/data', torch.rand(1))
f:close()

for i=1,N do
  local f = hdf5.open(n, 'r')
  f:read('/data'):all()
  f:close()
  xlua.progress(i, N)
end

Any ideas? Should I just not use hdf5?

d11 commented 8 years ago

Hmm, that's a pain. It's probably possible to use a similar workaround to the one described in the thread you linked. Did you find any way around it in the end?

jonathanasdf commented 8 years ago

No, I moved over to https://github.com/jonathantompson/torchzlib instead and it has worked for me with no problems.

gulvarol commented 8 years ago

Was this problem handled by anyone? I am using HDF5 1.8.17 in enable-threadsafe mode, and training with multi-threads loading many hdf5 files during network training. It never seems to release memory and it crashes after a few hours with 'too many open files' although i checked many times that opened files are closed.


HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 140023236646656:
  #000: H5F.c line 604 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 992 in H5F_open(): unable to open file: time = Wed Nov  2 11:31:31 2016
, name = '00018.h5', tent_flags = 0
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 993 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDsec2.c line 339 in H5FD_sec2_open(): unable to open file: name = '00018.h5', errno = 24, error message = 'Too many open files', flags = 0, o_flags = 0
    major: File accessibilty
    minor: Unable to open file

The same code doesn't crash with HDF5 1.8.12, it just leaks memory the same way.

anibali commented 7 years ago

I'm having what appears to be a similar issue, but with an earlier version of HDF5:

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139866049939328:
  #000: ../../../src/H5D.c line 445 in H5Dget_space(): unable to register data space
    major: Object atom
    minor: Unable to register new atom
  #001: ../../../src/H5I.c line 951 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group

This is after many iterations of open/read/close on a HDF5 file. Usually the program just hangs forever, I feel like I was "lucky" to even see the error message.

anibali commented 7 years ago

I created a fork of torch-hdf5 which works with HDF5 1.10 (https://github.com/anibali/torch-hdf5/tree/hdf5-1.10), installed HDF5 1.10 with 1.8 API compatibility, and reran the sample program provided by the OP. The program now finishes successfully, whereas before it did not. So either a) the issue is properly fixed in newer versions of HDF5, or b) the new 64-bit IDs in 1.10 have increased the number of available IDs but they will still eventually run out given enough time.

google-deepmind / torch-hdf5

Out of IDs after reading many files #63