HDF5 issue when reading hdf5 file

minasmz commented 2 years ago

Hi,

I was trying to run the script or GRU and I face the following error:

self._g_read_slice(startl, stopl, stepl, nparr) File "tables/hdf5extension.pyx", line 1585, in tables.hdf5extension.Array._g_read_slice tables.exceptions.HDF5ExtError: HDF5 error back trace

File "H5Dio.c", line 199, in H5Dread can't read data File "H5Dio.c", line 601, in H5Dread can't read data File "H5Dchunk.c", line 2229, in H5Dchunk_read unable to read raw data chunk File "H5Dchunk.c", line 3609, in H5D__chunk_lock data pipeline read failed File "H5Z.c", line 1326, in H5Z_pipeline filter returned failure during read File "hdf5-blosc/src/blosc_filter.c", line 254, in blosc_filter Blosc decompression error

End of HDF5 error back trace

Problems reading the array data.

In call to configurable 'train' (<function DLWrapper.train at 0x7fcbb45d11f0>) In call to configurable 'train_common' (<function train_common at 0x7fcbb4c64e50>) Closing remaining open files:../output_dir/ml_stage/ml_stage.h5...done../output_dir/ml_stage/ml_stage.h5...done

Process finished with exit code 1

I found that this issue is due to safe threading in hdf5. I am trying to recompile the hdf5 manually. I am not sure if it is the source of the issue. I wonder if you have faced a similar problem when turning the on_RAM flag to False and running the code? Also, it would be great if the version of HDF5 be specified.

Thank you.

hugoych commented 2 years ago

Hi, I would need a bit more information to see if I can reproduce your issue.

First, did you follow the instruction about package installation in the README? Indeed this procedure, through the installation of pytables and h5py, should ensure to have the correct version or hdf5. Thus, if you didn't please do and let me know !
- If you did follow these instructions, could you give me more information about the exact command you use. Indeed, there are multiple GRU models as there are multiple tasks, thus I would need to know which of the run_scripts/baselines/XXXX/GRU.sh you are running to see if I can reproduce the issue.
Thanks in advance

minasmz commented 2 years ago

Thank you for your quick response. I followed the exact instruction that you have mentioned and the pytables and h5py installed accordingly. I looked around for this issue which seems that it is due to the base hdf5 which is installed on os (ubuntu 18.04). A similar issue mention here . I tried to recompile the hdf5 myself, but the problem still exists. About your second question, the Mortality in 24h is running. Thank you

hugoych commented 2 years ago

Hi, Thanks for the clarification ! Given that this seems to be an issue with hdf5 version, could you give me the output of the following command right after you follow the setup commands, please?

conda activate icu-benchmark
conda list

hdf5 should be installed in 1.10.4 version

minasmz commented 2 years ago

Hi,

Yes, the hdf5 version is 1.10.4. The versions of the packages are exactly the same as the requirements.

minasmz commented 2 years ago

Even after compiling and installing the hdf5 (the basic hdf5 in C which has been installed on Ubuntu) with the thread_safe configuration, the problem did not solve.

hugoych commented 2 years ago

Hi, Could you check you are running on GPU either by checking that CUDA_VISIBLE_DEVICES is not empty or by checking that torch.cuda.is_available() == True. Indeed, I was able to reproduce your error by running on CPU with on_RAM=False. In that case, we define the number of workers to be 16 in the training wrapper, see: https://github.com/ratschlab/HIRID-ICU-Benchmark/blob/8bb039819e0693fa44d7fcb5d3a27097e843b3b2/icu_benchmarks/models/wrappers.py#L43

This multi-thread behavior is not compatible with hdf5 decompression procedure. I'll push a fix for it not use multiple workers in that setting! in the meantime, one can solve the issue by changing the number of workers in icu_benchmarks/models/wrappers.py L43:

 self.n_worker = 1

Could you tell if :

You are indeed running on GPU?
If yes, did you change the number of workers for the GPU case at line 40? https://github.com/ratschlab/HIRID-ICU-Benchmark/blob/8bb039819e0693fa44d7fcb5d3a27097e843b3b2/icu_benchmarks/models/wrappers.py#L40
If not, did the above change fix the problem?

Anyway, If this doesn't solve your problem, already thanks a lot for finding this edge case!

minasmz commented 2 years ago

Hi,

Thank you for the modification. I have changed the number of workers to 1 and the code runs without the mentioned issue. But eventually, the code was killed due to the excessive use of the CPU!

hugoych commented 2 years ago

Hi, Good to know ! I can't do much without any error log. But, I believe that this has to do with your hardware rather than our work! Thanks again for pointing out this edge case. I believe this particular issue is now solved thus I'll close it :)

minasmz commented 2 years ago

Thank you so much for all the help.

ratschlab / HIRID-ICU-Benchmark

HDF5 issue when reading hdf5 file #9