torralba-lab / im2recipe

Code supporting the CVPR 2017 paper "Learning Cross-modal Embeddings for Cooking Recipes and Food Images"
MIT License
370 stars 89 forks source link

Intermittent problems with HDF5 #5

Closed MicaelCarvalho closed 7 years ago

MicaelCarvalho commented 7 years ago

Hello,

We successfully ran the code for reproducing your results. However, we're facing an intermittent problem with HDF5. The program runs for a few hours normally, but after some time HDF5 crashes, apparently due to a memory leak — we can usually relaunch it and continue training later. I'm sending the full logs below.

We have tried different versions of HDF5 1.8, without success, and we weren't able to update the HDF5 version to 1.10, since it is incompatible with torch-hdf5.

I believe this issue is not coming from your code, but rather from a bad HDF5 integration or version. But could you please inform us whether you faced the same problem, and if you managed to solve it? If not, would you be able to disclose the HDF5 version you're using, as well as the torch version?

Thanks in advance! :-)

HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace                     
    major: Dataset        
    minor: Unable to initialize object                            
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type         
    major: Object atom
    minor: Out of IDs for group       
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace
    major: Dataset             
    minor: Unable to initialize object                            
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type         
    major: Object atom
    minor: Out of IDs for group       
/home/username/torch-pascal/install/bin/luajit: ...e/torch-pascal/install/share/lua/5.1/threads/threads.lua
:183: [thread 3 callback] .../username/torch-pascal/install/share/lua/5.1/hdf5/init.lua:83: Unable to get d
ataspace for dataset 'ims_train' in [HDF5Group 33812542 /]!
stack traceback:                                                  
        [C]: in function 'error'                                 
        .../username/torch-pascal/install/share/lua/5.1/hdf5/init.lua:83: in function '_loadObject'
        ...username/torch-pascal/install/share/lua/5.1/hdf5/group.lua:58: in function <...username/torch-pasc
al/install/share/lua/5.1/hdf5/group.lua:55>                                
        [C]: in function 'H5Literate'
        ...username/torch-pascal/install/share/lua/5.1/hdf5/group.lua:61: in function '__init'
        /home/username/.luarocks/share/lua/5.1/torch/init.lua:91: in function </home/username/.luarocks/share
/lua/5.1/torch/init.lua:87>
        [C]: in function 'HDF5Group'
        .../username/torch-pascal/install/share/lua/5.1/hdf5/init.lua:74: in function '_loadObject'
        .../username/torch-pascal/install/share/lua/5.1/hdf5/file.lua:19: in function '__init'
        /home/username/.luarocks/share/lua/5.1/torch/init.lua:91: in function </home/username/.luarocks/share
/lua/5.1/torch/init.lua:87>           
        [C]: in function 'open'                                            
        ./loader/DataLoader.lua:43: in function <./loader/DataLoader.lua:37>
        [C]: in function 'xpcall'     
        ...e/torch-pascal/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        ...ene/torch-pascal/install/share/lua/5.1/threads/queue.lua:65: in function <...ene/torch-pascal/
install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        ...ene/torch-pascal/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
        [C]: in function 'error'
        ...e/torch-pascal/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        ...e/torch-pascal/install/share/lua/5.1/threads/threads.lua:223: in function 'addjob'
        /net/big/username/doc/im2recipe/drivers/train.lua:191: in function </net/big/username/doc/im2recipe/$
rivers/train.lua:189>
        /net/big/username/doc/im2recipe/drivers/init.lua:43: in function 'train'
        main.lua:70: in main chunk
        [C]: in function 'dofile'
        ...rch-pascal/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405b60
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 139679846299392:
  #000: H5D.c line 463 in H5Dget_space(): unable to get dataspace
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 2769 in H5D_get_space(): unable to register dataspace
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group
amaiasalvador commented 7 years ago

I've never seen this error before. It seems related to this issue in the torch-hdf5 repo, although it does not seem to be solved.

We are using HDF5 1.8.11 Regarding torch, this is the last commit in the version we are using:

commit a58889e5289ca16b78ec7223dd8bbc2e01ef97e0 Merge: cb3ad52 8abc4ba Author: Soumith Chintala soumith@gmail.com Date: Thu Oct 27 11:45:31 2016 -0400

Merge pull request #170 from howard0su/winbuild

Support build torch7 on windows
Cadene commented 7 years ago

Thanks for your help.

Micael and I are 99% sure that we've found a fix, because it takes time before the error occurs (it usually crash after 24 hours of training). Anyway, since https://github.com/torralba-lab/im2recipe/commit/67da133916d51cecbebd7b46b2947fc8ea1a71f2, we did not encounter the HDF5 error anymore.