jpcbertoldo commented 3 years ago

[silx.io.utils.get_data] h5py cannot read different files in parallel?

Context

data

I have this .h5 with a dataset composed of 36 virtual external sources, each pointing to a ("concrete") dataset of 100 frames in other separate .h5 files.

problem

I'm using silx.io.utils.get_data to get this data in a script, and this read operation turns out to be 1/3 of the time I spend in the script.

Solution (?)

I thought: since these .h5 are different files, I could read them individually in parallel processes, dumping their returned values (ndarrays) into a shared memory ndarray.

my implementation

Here is the code that I tried to run (it should be accessible for the folks in ESRF I suppose), and it looks roughly like this:


  # a SIMPLIFIED VERSION of the code I experimented with

  def _worker_get_data(chunk_url, zslice):
      # data_shared gets here from the initialize function
      data_shared[zslice, :, :] = silx.io.get_data(chunk_url)

  from multiprocessing import Pool

  mp_data_shared = RawArray(c_data_dtype, data_size)
  data_shared = np.frombuffer(mp_data_shared).reshape(data_shape)  # becomes 3D

  # the z-index start/stop of each chunk's window
  # chunk = each 100-frame dataset in an individual .h5
  zslices = [slice(i * chunk_zsize, (i + 1) * chunk_zsize) for i in range(nchunks)]

  # _worker_init witll pass the shared data buffer and make sure every subprocess 
  # has access to a memory-shared ndarray
  with Pool(initializer=_worker_init, initargs=(mp_data_shared, ...,) as pool:

      pool.starmap_async(_worker_get_data, zip(chunk_urls, zslices))  # chunk urls are given to `get_data`

  pool.join()

but it doesn't work

I tested using 1, 2, 8, and 16 processes (nprocs) on a 16-core machine, and they all took about 30 seconds.

So I profiled some runs an here is what I found:

nprocs = 1: the function get_data -- which is dominated more specifically by h5py._hl.dataset.Dataset.__getitem__ -- takes about 1 second
nprocs = 2: the same function takes ~1.6 seconds
nprocs = 16: it takes 11 seconds

Even though the calls are in different processes, each reading distinct files.

Conclusion

I'm guessing there is some kind of lock on the library (silx.io or h5py?) level. Is that correct?

More importantly: is there a workaround for this?

further details

Here is a link to my personal notes (so not super didatic) when I was trying to debug this.

The relevant section is "reading a dataset with silx in parallel".

vasole commented 3 years ago

You did not post any error message.

To me is not clear if you are accessing all the distinct files independently. If you are accessing them from the virtual dataset I do not think the reads are independent.

jpcbertoldo commented 3 years ago

There is no error message in fact.

I only use the virtual dataset to figure out the links that I need to the individual chunk files.

At the moment that I read the data I reference them directly from their files, so (AFAI understand) there is nothing linking them in my code -- unless there is something being held open when I use the virtual sources (?).

jpcbertoldo commented 3 years ago

This is where the read operation happens inside h5py._hl.dataset.Dataset.__getitem__ (line 787).

function h5s.create_simple (link): i don`t understand what this does

then there is the self.id.read, where self.id is

in HLObject.__init__ (Dataset inherits from that)

in Dataset.__init__

I'm not sure what these are supposed to be, but it's related to this: h5py.h5d.DatasetID (doc) (implementation in)

kif commented 3 years ago

One question: where are your files ? NFS/GPFS or local ?

jpcbertoldo commented 3 years ago

One question: where are your files ? NFS/GPFS or local ?

I'm not sure, but if it helps I'm working on the "rnice" cluster at ESRF.

These files where I run the test are in /data/id11/3dxrd/blc12852/id11/bmg_l1/bmg_l1_bmg_dct2/scan0002/, and it goes like marana_0000.h5, marana_0001.h5, ...

Each has the data at the link entry_0000/ESRF-ID11/marana/data.

kif commented 3 years ago

Hi Joao, Data access is not uniform on the cluster and depending on the computer it can go via NFS or GPFS and the difference is already huge (4x difference). Moreover, there is a large overhead from using VFS. When performance matters, I always go straight to the Lima files (where imags actually are). Just some other considerations: you use multiprocessing.Pool to workaround the GIL. That's OK, but having multiple processes implies serialization of data to go from any worker to the master. I did some work (2years back) on making h5py GIL-free (where it matters) so that the reading can be done using a ThreadPool instead of processes. Nota, I did not benchmark myself.

Do you use compression for your data ?

kif commented 3 years ago

Your data are uncompressed ... it makes the constrain on the networked filesystem even larger...

kif commented 3 years ago

I tried to read your 28G of data from a computer connected with a 10G link to the storage, in NFS (hence sub-optimal) and this is what I got:


In [1]: import h5py, glob

In [2]: files = glob.glob("ma*.h5"); files.sort()

In [3]: %time big = [h5py.File(i)["entry_0000/measurement/data"][()] for i in files]
CPU times: user 100 ms, sys: 12.8 s, total: 12.9 s
Wall time: 41.4 s

In [4]: sum(i.nbytes for i in big)/41.4/1e6
Out[4]: 729.4441739130435```

I took care to empty all caches in this computer (remount the NFS filesystem) and obtained 730MB/s over a 10G link (which top-up at 1000MB/s). To me, no need to go to parallel reads. The cluster has now 25G card that are maybe slightly faster. To actually go faster, you should read from multiple computers via GPFS, or copy once for good those data to some local working area.

So my conclusion: you are saturating the network link. Consider compression if you want faster data access ...

kif commented 3 years ago

I tried to compress your dataset with bitshuffle-LS4 and the compression ratio is only 2.5. It won't help you much, sorry.

jpcbertoldo commented 3 years ago

Hi Jerome, Thanks for that info, I wasn't aware of that.

1) Sorry, I don't know what these 'lima files' are. From those files I tested on, is it possible to do that?

I did some work (2years back) on making h5py GIL-free (where it matters) so that the reading can be done using a ThreadPool instead of processes.

2) Could you share that?

3) I don't think the issue is with the inter-process communication because (in that method I used) there is a block of shared memory holding the data, and the parent process only sends indexes to the children (a few ints)

Looking at the profile (acquired with viztracer) it doesn't look like the parallelization overhead is relevant:

Screenshot from 2021-05-14 17-19-29

link to the full profile in html

jpcbertoldo commented 3 years ago

I tried to read your 28G of data from a computer connected with a 10G link to the storage, in NFS (hence sub-optimal) and this is what I got:
In [1]: import h5py, glob
...
So my conclusion: you are saturating the network link. Consider compression if you want faster data access ...
I tried to compress your dataset with bitshuffle-LS4 and the compression ratio is only 2.5. It won't help you much, sorry.

Thanks for all that info!

kif commented 3 years ago

The work I did in h5py is now integrated deployed as part of h5py 3 ... so you are probably using it.

t20100 commented 2 years ago

Closing as no more discussion and topic is more hdf5/h5py than silx.

silx-kit / silx