hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 6 forks source link

Parallel zarr with Queue instance attributes #118

Closed CodyCBakerPhD closed 1 year ago

CodyCBakerPhD commented 1 year ago

fix #101

replace #111

Motivation

Zarr supports efficient parallelization, but enabling it seamlessly with only a single argument (number_of_jobs at io.write) took a bit of effort.

Currently seeing progressive speedups with the attached dummy script as the number of jobs increases; on the DANDI Hub ~160s for 1 CPU, . Will make an averaged plot over the number of jobs to use for reference

Screenshot 2023-07-30 at 5 37 08 PM Screenshot 2023-07-30 at 5 37 30 PM Screenshot 2023-07-30 at 5 38 08 PM

Will make a full averaged plot over the number of jobs to use for reference

Opening in draft while I assess what all is still necessary and what can still be optimized in terms of worker/job initialization

Also will have to think on how to add tests; I suppose just adding some that use 2 jobs and making sure it works should be enough

How to test the behavior?

import numpy as np
from pathlib import Path

from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries
from neuroconv.tools.hdmf import SliceableDataChunkIterator

number_of_jobs = 1  # increase according to screenshot

dat_file_path = "/home/jovyan/performance_tests/example_data.dat"
n_frames = 30000 * 60 * 2
n_channels = 384
data_shape = (n_frames, n_channels)
dtype = "int16"
memory_map = np.memmap(filename=dat_file_path, dtype=dtype, mode="r", shape=data_shape)  # about ~2.75 GB of data

nwbfile = mock_NWBFile()
time_series = mock_TimeSeries(data=SliceableDataChunkIterator(data=memory_map, buffer_gb=0.1 / number_of_jobs))
nwbfile.add_acquisition(time_series)

zarr_top_level_path = f"/home/jovyan/Downloads/example_parallel_zarr_{number_of_jobs}.nwb"
with NWBZarrIO(path=zarr_top_level_path, mode="w") as io:
    io.write(nwbfile, number_of_jobs=number_of_jobs)

Checklist

oruebel commented 1 year ago

ReadTheDocs shows the following error due to the added threadpoolctl requirement.

ERROR: Could not find a version that satisfies the requirement threadpoolctl==3.2.0 (from versions: 1.0.0, 1.1.0, 2.0.0, 2.1.0, 2.2.0, 3.0.0, 3.1.0)
ERROR: No matching distribution found for threadpoolctl==3.2.0

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

CodyCBakerPhD commented 1 year ago

Maybe the readthedocs.yaml or the requirements.txt file need some adjustment.

I do believe this is an issue with the version of Python being used to compile the docs - do you know what that is?

You can also specify it explicitly in the config file like here: https://github.com/catalystneuro/neuroconv/blob/main/.readthedocs.yaml#L10-L11

Alternative would I suppose be to lower the exact version pin for the CI, but I defer to how y'all prefer to have all that setup

rly commented 1 year ago

Some tests define a new ZarrIO and call write_dataset directly. Because self.__dci_queue is initialized only in write and now export, these tests fail because self.dci_queue is None. If these tests are meant to test write_dataset in a unit test fashion, then these tests need to be adjusted so that write is called first or the __dci_queue variable is otherwise set. If these tests are meant to be integration tests, then these tests need to be adjusted so that they call write instead of write_dataset which users would not be doing.

CodyCBakerPhD commented 1 year ago

Ahh good catch: https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L839-L840

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

oruebel commented 1 year ago

If these tests are meant to test write_dataset in a unit test fashion,

Those should be unit tests, since write_dataset is not a a function that a user should ever call, but is used internally. The tests should be adjusted to manually set the dci_queue variable before calling write_dataset. Alternatively, we could also add if __dci_queue is None at the beginning of write_dataset to set it if it is not initialized (which may be safer)

oruebel commented 1 year ago

Since the method is not marked as private I'll just instantiate a standard non-parallel Queue at that point then

Sorry, I didn't see your comment until after I posted my other response. I agree, "instantiate a standard non-parallel Queue at that point then" is the way to go.

CodyCBakerPhD commented 1 year ago

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

CodyCBakerPhD commented 1 year ago

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

oruebel commented 1 year ago

Surprised that 3.7 is still supported here - is there a timeline for when that will be dropped?

This is due to the pin on the HDMF version. Once we have the issue with references on export resolved we'll update the HDMF version and then we can also update the tests. @mavaylon1 is working on the issue.

oruebel commented 1 year ago

Otherwise, the currently failing CI is, I believe, due to the version pin of hdmf==3.5.4, which this feature requires some changes available on hdmf>=3.9.0

To get the tests to run for now, you could just increase the hdmf version on this PR so we can see that the CI is running. There will be a couple of tests that fail for export, but at least we can then see that everything is working in the CI. Aside from the bug on export, I believe you can safely use the current version of HDMF without having to change anything else in the code.

oruebel commented 1 year ago

@CodyCBakerPhD with #120 now merged, the dev branch now supports the latest HDMF. Could you sync this PR with the dev branch to see if that fixes the failing tests now so that we can move forward with merging this PR as well.

codecov-commenter commented 1 year ago

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c262481) 84.73% compared to head (9d84044) 85.66%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #118 +/- ## ========================================== + Coverage 84.73% 85.66% +0.92% ========================================== Files 12 13 +1 Lines 2903 3139 +236 ========================================== + Hits 2460 2689 +229 - Misses 443 450 +7 ``` | [Files](https://app.codecov.io/gh/hdmf-dev/hdmf-zarr/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hdmf-dev) | Coverage Δ | | |---|---|---| | [src/hdmf\_zarr/backend.py](https://app.codecov.io/gh/hdmf-dev/hdmf-zarr/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hdmf-dev#diff-c3JjL2hkbWZfemFyci9iYWNrZW5kLnB5) | `90.41% <100.00%> (+0.07%)` | :arrow_up: | | [tests/unit/test\_parallel\_write.py](https://app.codecov.io/gh/hdmf-dev/hdmf-zarr/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hdmf-dev#diff-dGVzdHMvdW5pdC90ZXN0X3BhcmFsbGVsX3dyaXRlLnB5) | `98.57% <98.57%> (ø)` | | | [src/hdmf\_zarr/utils.py](https://app.codecov.io/gh/hdmf-dev/hdmf-zarr/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hdmf-dev#diff-c3JjL2hkbWZfemFyci91dGlscy5weQ==) | `95.79% <95.32%> (-0.96%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

CodyCBakerPhD commented 1 year ago

@oruebel Done, not sure what's up with the coverage workflows though