Closed jpcbertoldo closed 2 years ago
You did not post any error message.
To me is not clear if you are accessing all the distinct files independently. If you are accessing them from the virtual dataset I do not think the reads are independent.
There is no error message in fact.
I only use the virtual dataset to figure out the links that I need to the individual chunk files.
At the moment that I read the data I reference them directly from their files, so (AFAI understand) there is nothing linking them in my code -- unless there is something being held open when I use the virtual sources (?).
This is where the read operation happens inside h5py._hl.dataset.Dataset.__getitem__
(line 787).
function h5s.create_simple
(link): i don`t understand what this does
then there is the self.id.read
, where self.id
is
in HLObject.__init__
(Dataset
inherits from that)
in Dataset.__init__
I'm not sure what these are supposed to be, but it's related to this: h5py.h5d.DatasetID (doc)
(implementation in)
One question: where are your files ? NFS/GPFS or local ?
One question: where are your files ? NFS/GPFS or local ?
I'm not sure, but if it helps I'm working on the "rnice" cluster at ESRF.
These files where I run the test are in /data/id11/3dxrd/blc12852/id11/bmg_l1/bmg_l1_bmg_dct2/scan0002/
, and it goes like marana_0000.h5
, marana_0001.h5
, ...
Each has the data at the link entry_0000/ESRF-ID11/marana/data
.
Hi Joao, Data access is not uniform on the cluster and depending on the computer it can go via NFS or GPFS and the difference is already huge (4x difference). Moreover, there is a large overhead from using VFS. When performance matters, I always go straight to the Lima files (where imags actually are). Just some other considerations: you use multiprocessing.Pool to workaround the GIL. That's OK, but having multiple processes implies serialization of data to go from any worker to the master. I did some work (2years back) on making h5py GIL-free (where it matters) so that the reading can be done using a ThreadPool instead of processes. Nota, I did not benchmark myself.
Do you use compression for your data ?
Your data are uncompressed ... it makes the constrain on the networked filesystem even larger...
I tried to read your 28G of data from a computer connected with a 10G link to the storage, in NFS (hence sub-optimal) and this is what I got:
In [1]: import h5py, glob
In [2]: files = glob.glob("ma*.h5"); files.sort()
In [3]: %time big = [h5py.File(i)["entry_0000/measurement/data"][()] for i in files]
CPU times: user 100 ms, sys: 12.8 s, total: 12.9 s
Wall time: 41.4 s
In [4]: sum(i.nbytes for i in big)/41.4/1e6
Out[4]: 729.4441739130435```
I took care to empty all caches in this computer (remount the NFS filesystem) and obtained 730MB/s over a 10G link (which top-up at 1000MB/s). To me, no need to go to parallel reads. The cluster has now 25G card that are maybe slightly faster. To actually go faster, you should read from multiple computers via GPFS, or copy once for good those data to some local working area.
So my conclusion: you are saturating the network link. Consider compression if you want faster data access ...
I tried to compress your dataset with bitshuffle-LS4 and the compression ratio is only 2.5. It won't help you much, sorry.
Hi Jerome, Thanks for that info, I wasn't aware of that.
1) Sorry, I don't know what these 'lima files' are. From those files I tested on, is it possible to do that?
I did some work (2years back) on making h5py GIL-free (where it matters) so that the reading can be done using a ThreadPool instead of processes.
2) Could you share that?
3) I don't think the issue is with the inter-process communication because (in that method I used) there is a block of shared memory holding the data, and the parent process only sends indexes to the children (a few int
s)
Looking at the profile (acquired with viztracer
) it doesn't look like the parallelization overhead is relevant:
I tried to read your 28G of data from a computer connected with a 10G link to the storage, in NFS (hence sub-optimal) and this is what I got:
In [1]: import h5py, glob ... So my conclusion: you are saturating the network link. Consider compression if you want faster data access ...
I tried to compress your dataset with bitshuffle-LS4 and the compression ratio is only 2.5. It won't help you much, sorry.
Thanks for all that info!
The work I did in h5py is now integrated deployed as part of h5py 3 ... so you are probably using it.
Closing as no more discussion and topic is more hdf5/h5py than silx.
[silx.io.utils.get_data] h5py cannot read different files in parallel?
Context
data
I have this .h5 with a dataset composed of 36 virtual external sources, each pointing to a ("concrete") dataset of 100 frames in other separate .h5 files.
problem
I'm using
silx.io.utils.get_data
to get this data in a script, and this read operation turns out to be 1/3 of the time I spend in the script.Solution (?)
I thought: since these .h5 are different files, I could read them individually in parallel processes, dumping their returned values (
ndarray
s) into a shared memoryndarray
.my implementation
Here is the code that I tried to run (it should be accessible for the folks in ESRF I suppose), and it looks roughly like this:
but it doesn't work
I tested using 1, 2, 8, and 16 processes (
nprocs
) on a 16-core machine, and they all took about 30 seconds.So I profiled some runs an here is what I found:
nprocs = 1
: the functionget_data
-- which is dominated more specifically byh5py._hl.dataset.Dataset.__getitem__
-- takes about 1 secondnprocs = 2
: the same function takes ~1.6 secondsnprocs = 16
: it takes 11 secondsEven though the calls are in different processes, each reading distinct files.
Conclusion
I'm guessing there is some kind of lock on the library (
silx.io
orh5py
?) level. Is that correct?More importantly: is there a workaround for this?
further details
Here is a link to my personal notes (so not super didatic) when I was trying to debug this.
The relevant section is "reading a dataset with silx in parallel".