hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

[Bug]: Iterative write for compound reference datasets takes 5ever #144

Closed sneakers-the-rat closed 7 months ago

sneakers-the-rat commented 9 months ago

What happened?

hello hello! checking out how you're structuring the zarr export so I can mimic it, and first try wasn't able to complete an export because after ~10m or so i had only written ~6MB of ~3GB and so i got to profiling. Searched for an issue but didn't find one, sorry if this has been raised already.

The problem is here: https://github.com/hdmf-dev/hdmf-zarr/blob/9c3b4bee013c825e2cdaccbe443009f88302803e/src/hdmf_zarr/backend.py#L1021-L1026

where each row has to be separately serialized on write.

I see this behavior elsewhere like in __list_fill__ so it seems like this is a general problem with write indexing into object-type zarr arrays - when I tried to collect all the new items in a list and write them like dset[...] = new_items, i got a shape error (and couldn't fix it by setting the shape correctly in the require_dataset ).

What did work, though, is doing this:

if len(refs) > 0:
    dset = parent.require_dataset(name,
                  shape=(len(data), len(data[0])),
                  dtype=object,
                  object_codec=self.__codec_cls(),
                  **options['io_settings'])
    self._written_builders.set_written(builder)  # record that the builder has been written
    dset.attrs['zarr_dtype'] = type_str

    new_items = []
    for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
            new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(new_item)

    dset[...] = np.array(new_items)

The only difference that I could see is that rather than returning a python list when accessing the data again, it returned an (object-typed) numpy array.

The difference in timing is considerable (in my case, tool-breaking, which is why i'm raising as a bug rather than a feature request). When converting this dataset with 9 compound datasets w/ references, putting a timer around the above block:

Array write Iterative Write
0.03479409217834473 1.675
0.010181903839111328 0.205
0.19667983055114746 62.746
1.0213229656219482 1808.36
2.000258207321167 (dnf, it's been 30m and i need to go home)
0.31616997718811035 ...
0.0024509429931640625 ...
0.3125300407409668 ...
0.01106715202331543 ...

I only spent ~20m or so profiling and debugging, so it doesn't replicate current top-level behavior exactly (creates an object array rather than a list of lists), but imo it might be better: the result is the same shape as the source dataset (so, in this case, an n x 3 array) and can be indexed as such, where the current implementation can't index by column since the array is just a one-dimensional array of serialized python lists. It also looks like it would be possible to improve it further still by using a numpy structured dtype (zarr seems to be able to handle that, at least from the docs, but the require_dataset method doesn't like it), etc. but hopefully useful as a first pass.

The "steps to reproduce" below is just what I put in a test script and ran cProfile on to diagnose:

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO

in_file = "/location/of/data/nwb/sub-738651046_ses-760693773.nwb"
out_file = "./test2.zarr"

with NWBHDF5IO(in_file, 'r', load_namespaces=True) as read_io:
    with NWBZarrIO(out_file, mode="w") as export_io:
        export_io.export(src_io=read_io, write_args={'link_data':False})

Traceback

(not really a traceback kinda problem)

Operating System

macOS

Python Executable

Python

Python Version

3.11

Package Versions

environment_for_issue.txt

Code of Conduct

oruebel commented 9 months ago

Thanks for detailed issue.

Just for background, the reason the Zarr backend needs to inspect all the values here is because Zarr does not natively support object references. So to work around this issue we represent references as JSON objects to allow us to describe and resolve references.

Reducing the number of write operations by generating the whole dataset in memory first and then writing in one single write call (rather than writing each element one-at-a-time), is definitely a good practice. In most files, the number of references is relatively small. But in this case it seems there are a lot reference and if they are in compound datasets (e.g,. a TimeSeriesReferenceVectorData) then that problem is probably further exasperated. We'll take a look or if you want , feel free to create a PR with your fix that we can work on together.