[Bug]: Iterative write for compound reference datasets takes 5ever

What happened?

hello hello! checking out how you're structuring the zarr export so I can mimic it, and first try wasn't able to complete an export because after ~10m or so i had only written ~6MB of ~3GB and so i got to profiling. Searched for an issue but didn't find one, sorry if this has been raised already.

The problem is here: https://github.com/hdmf-dev/hdmf-zarr/blob/9c3b4bee013c825e2cdaccbe443009f88302803e/src/hdmf_zarr/backend.py#L1021-L1026

where each row has to be separately serialized on write.

I see this behavior elsewhere like in __list_fill__ so it seems like this is a general problem with write indexing into object-type zarr arrays - when I tried to collect all the new items in a list and write them like dset[...] = new_items, i got a shape error (and couldn't fix it by setting the shape correctly in the require_dataset ).

What did work, though, is doing this:

if len(refs) > 0:
    dset = parent.require_dataset(name,
                  shape=(len(data), len(data[0])),
                  dtype=object,
                  object_codec=self.__codec_cls(),
                  **options['io_settings'])
    self._written_builders.set_written(builder)  # record that the builder has been written
    dset.attrs['zarr_dtype'] = type_str

    new_items = []
    for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
            new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(new_item)

    dset[...] = np.array(new_items)

The only difference that I could see is that rather than returning a python list when accessing the data again, it returned an (object-typed) numpy array.

The difference in timing is considerable (in my case, tool-breaking, which is why i'm raising as a bug rather than a feature request). When converting this dataset with 9 compound datasets w/ references, putting a timer around the above block:

Array write	Iterative Write
0.03479409217834473	1.675
0.010181903839111328	0.205
0.19667983055114746	62.746
1.0213229656219482	1808.36
2.000258207321167	(dnf, it's been 30m and i need to go home)
0.31616997718811035	...
0.0024509429931640625	...
0.3125300407409668	...
0.01106715202331543	...

I only spent ~20m or so profiling and debugging, so it doesn't replicate current top-level behavior exactly (creates an object array rather than a list of lists), but imo it might be better: the result is the same shape as the source dataset (so, in this case, an n x 3 array) and can be indexed as such, where the current implementation can't index by column since the array is just a one-dimensional array of serialized python lists. It also looks like it would be possible to improve it further still by using a numpy structured dtype (zarr seems to be able to handle that, at least from the docs, but the require_dataset method doesn't like it), etc. but hopefully useful as a first pass.

The "steps to reproduce" below is just what I put in a test script and ran cProfile on to diagnose:

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO

in_file = "/location/of/data/nwb/sub-738651046_ses-760693773.nwb"
out_file = "./test2.zarr"

with NWBHDF5IO(in_file, 'r', load_namespaces=True) as read_io:
    with NWBZarrIO(out_file, mode="w") as export_io:
        export_io.export(src_io=read_io, write_args={'link_data':False})

Traceback

(not really a traceback kinda problem)

Operating System

macOS

Python Executable

Python

Python Version

3.11

Package Versions

environment_for_issue.txt

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this bug was not already reported?

hdmf-dev / hdmf-zarr