Add parallel zarr - Githubissues

CodyCBakerPhD commented 1 year ago

fix hdmf-dev/hdmf-zarr#101

Motivation

Zarr supports efficient parallelization, but enabling it seamlessly with only a single argument (number_of_jobs at io.write) took a bit of effort.

Currently seeing progressive speedups with the attached dummy script as the number of jobs increases; on the DANDI Hub ~160s for 1 CPU, . Will make an averaged plot over the number of jobs to use for reference

Will make a full averaged plot over the number of jobs to use for reference

Opening in draft while I assess what all is still necessary and what can still be optimized in terms of worker/job initialization

Also will have to think on how to add tests; I suppose just adding some that use 2 jobs and making sure it works should be enough

How to test the behavior?

import numpy as np
from pathlib import Path

from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries
from neuroconv.tools.hdmf import SliceableDataChunkIterator

number_of_jobs = 1  # increase according to screenshot

dat_file_path = "/home/jovyan/performance_tests/example_data.dat"
n_frames = 30000 * 60 * 2
n_channels = 384
data_shape = (n_frames, n_channels)
dtype = "int16"
memory_map = np.memmap(filename=dat_file_path, dtype=dtype, mode="r", shape=data_shape)  # about ~2.75 GB of data

nwbfile = mock_NWBFile()
time_series = mock_TimeSeries(data=SliceableDataChunkIterator(data=memory_map, buffer_gb=0.1 / number_of_jobs))
nwbfile.add_acquisition(time_series)

zarr_top_level_path = f"/home/jovyan/Downloads/example_parallel_zarr_{number_of_jobs}.nwb"
with NWBZarrIO(path=zarr_top_level_path, mode="w") as io:
    io.write(nwbfile, number_of_jobs=number_of_jobs)

Checklist

[ ] Did you update CHANGELOG.md with your changes?
[X] Have you checked our Contributing document?
[X] Have you ensured the PR clearly describes the problem and the solution?
[X] Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
[X] Have you checked to ensure that there aren't other open Pull Requests for the same change?
[X] Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

oruebel commented 1 year ago

I suppose just adding some that use 2 jobs and making sure it works should be enough

Focusing on "functions correctly" in the unit tests is fine. Doing tests to check for performance is tricky.

enabling it seamlessly with only a single argument (number_of_jobs at io.write) took a bit of effort.

Thanks for the hard work to make this easy for users.

It would be great to also add a short gallery tutorial for parallel write here https://github.com/hdmf-dev/hdmf-zarr/tree/dev/docs/gallery

A nice-to-have feature would be to enhance ZarrIO.export (https://github.com/hdmf-dev/hdmf-zarr/blob/d47fc8249e76a371fff990b07218a088e59ea590/src/hdmf_zarr/backend.py#L231) to use parallel write when converting from HDF5 to Zarr. One way to achieve this would be to automatically wrap datasets in DataChunkIterator in ZarrIO.write_dataset. This is probably beyond this PR and can be addressed later in a separate issue. Just wanted to mention it.

bendichter commented 1 year ago

Good idea, @oruebel I'll create a follow-up issue for that

oruebel commented 1 year ago

BTW, if pickling is an issue https://github.com/cloudpipe/cloudpickle may be useful

CodyCBakerPhD commented 1 year ago

BTW, if pickling is an issue https://github.com/cloudpipe/cloudpickle may be useful

Yeah I looked into this after the dev days; cloudpickle definitely seems useful in general, I've seen cases before of custom .pkl files that can't be loaded in environments different from what they were saved in, which is what that package seems to emphasize. And the general approach of pickling by value definitely seems preferable than by reference

However the 'problem' that is more relevant to us in this context is dealing with class types that do have picking methods defined (strictly speaking) but ones that are more inefficient than they need to be. In particular, numpy array/memmap and zarr datasets serialize their data contents by default instead of passing minimal information (store or file paths) needed to reinstantiate the the object in another workers memory

It's also not really even a 'problem' here strictly speaking, since this PR only assumes the iterator has such methods defined and builds around it - it's really the burden of the creator of such iterators (mainly NeuroConv https://github.com/catalystneuro/neuroconv/pull/536) to make them efficient

CodyCBakerPhD commented 1 year ago

OK the tests have more or less shaped together now

Let me know if there's more to add

oruebel commented 1 year ago

@CodyCBakerPhD sorry for the delay in responding to your questions. I hope the comment I added help clarify the questions you had raised. Please ping me in case I missed anything.

oruebel commented 1 year ago

@CodyCBakerPhD also, could you please add an entry for these changes to the CHANGLOG.md

CodyCBakerPhD commented 1 year ago

replaced by #118

hdmf-dev / hdmf-zarr

Add parallel zarr #111

Motivation

How to test the behavior?

Checklist