[Bug]: reading unit table arrays via string keys much slower than reading as properties

dyf commented 8 months ago

What happened?

I am trying to read information from the units table in this file: s3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr.

I notice a large difference in read times when accessing units.spike_times vs units['spike_times']:

This is surprising behavior.

Steps to Reproduce

from hdmf_zarr import NWBZarrIO
import time

nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'

io = NWBZarrIO(nwb_file, "r")
nwbfile_read = io.read()

t0 = time.time()
spike_times = nwbfile_read.units.spike_times[:]
print(time.time()-t0)
t0 = time.time()
spike_times = nwbfile_read.units['spike_times'][:]
print(time.time()-t0)

Traceback

none

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

I am using @alejoe91's branch for reading from S3:

https://github.com/alejoe91/hdmf-zarr/tree/fix-linking-in-remote-zarr

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this bug was not already reported?

oruebel commented 8 months ago

Thanks for the helpful issue. I had to modify your script slightly to add anonymous access io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True)) but am able to run it now. Just for reference, below the script I'm using for profiling.

These calls are actually taking different code paths. If you print the type of the objects that are being sliced into via:

print(type(nwbfile_read.units.spike_times))
print(type(nwbfile_read.units['spike_times']))

you should see:

<class 'hdmf.common.table.VectorData'>
<class 'hdmf.common.table.VectorIndex'>

I.e., units.spike_times and units['spike_times'] are different. This is because the "spike_times" column is indexed, i.e., the column is a ragged array that is represented by two datasets: 1) spike_times with the actual data values and 2) spike_times_index that segments the values to create the ragged array (i.e., a list of lists). Both of these are properties on the Units table, i.e., there is a property units.spike_times and units.spike_times_index. These properties exist to allow "low-level" access to the underlying data structure.

Selection via slicing (e.g., units['spike_times']) on the other hand is to insulate the user from these low-levels details and help with accessing elements in the table. I.e., when using the slicing syntax units['spike_times'] the class knows that this is a ragged column and uses the index column instead. I.e., the equivalent to units['spike_times'] would be units.spike_times_index.

As a result, the output from these operations is also different. When slicing into the units.spike_times[:] (which is VectorData with just the values), you get a 1D numpy array of all the spike times of all the units as one large array. In contrast when slicing into units['spike_times'] (or units.spike_times_index, which is a VectorIndex) you get a list of numpy arrays, where each array in the list has the spike times for a single unit.

I'll need to do some more in-depth profiling, but my guess is that the difference in timing between these operations is then likely due to:

the need to read more data, because we have to read both the spike_times and the spike_times_index
the need for many smaller read operations (rather than a single large read operation). At least I assume that the code to create the arrays for all the units likely reads the array for each unit separately.

The latter part may be something that could potentially be improved by reading all the data into memory first and then segmenting the array into its parts, rather than reading the array for each unit separately.

Profiling script

from hdmf_zarr import NWBZarrIO
from line_profiler import profile

@profile
def main():
    nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'
    io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True))
    nwbfile_read = io.read()
    spike_times = nwbfile_read.units.spike_times[:]
    spike_times = nwbfile_read.units['spike_times'][:]

if __name__ == '__main__':
    main()

dyf commented 8 months ago

Thanks @oruebel. This makes sense but it's pretty surprising behavior given the shape of the API. If there are opportunities to improve and standardize performance that would be great.

oruebel commented 6 months ago

@dyf Just FYI, I've created a separate issue for this here https://github.com/NeurodataWithoutBorders/nwb_benchmarks/issues/13 We are working on setting up a nwb_benchmarks suite for performance tests, with an initial focus on cloud read. This is still early stage, but I just wanted to let you know that we are working on evaluating performance.

hdmf-dev / hdmf-zarr