performance issue when indexing table

bendichter commented 4 years ago

It seems the auto-dereferencing dynamic tables code is struggling with these nested tables. Test data is here. On my computer, after setting up the file read

from pynwb import NWBHDF5IO

fpath = '001_140327B01_EXP_ndx.nwb'
io = NWBHDF5IO(fpath,'r')
nwbfile = io.read()

this line:

nwbfile.ic_conditions[0]

took 57 seconds to create a 1x1 pandas DataFrame. Is this how the API is intended to be used? I think this is due to the fact that indexing the downstream tables now reads all of that data into memory while creating a pandas DataFrame.

oruebel commented 4 years ago

@bendichter thanks for creating the issue. I'll take a look. I'm not sure if the data read is culprit here or maybe some slow loops in Python. We'll see. Either way, this should not take this long.

oruebel commented 4 years ago

Still working on this one. The creation of the table for the intracellular_recordings is already slow, i.e., nwbfile.intracellular_recordings.to_dataframe(). I'm wondering whether maybe something is slow with the resolution of links.

oruebel commented 4 years ago

I profiled the call nwbfile.intracellular_recordings[:] and it looks like the main problem seems to be the time for resolving the references (see profile trace below). This leads us two issues:

1) Since we are not loading any data from the references in this case, we should be able to delay the resolution to when we actually need the container. This a feature that would need to be added in HDMF 2) The number of TimeSeries used to store this data is very large. There are 1124 intracellular recordings, for each of which we have to resolve 2 unique TimeSeries and 1 Electrode reference. The fact that there are 1124 stimulus and 1124 acquisitions is in general problematic (not just for PyNWB). Having to iterate over and load data from thousands of objects is problematic. Since the IntracellularRecordings table allows us to select ranges in time in a TimeSeries as a response/stimulus, I think we should ideally "concatenate" sequential recordings from the same electrode (i.e., from a Sweep or if feasible maybe even a a SweepSequence) into a single PatchClampSeries. This would help significantly decrease the number of objects in a file and lead to more reasonably sized TimeSeries. @lvsltz do you think it would be possible to change the example file to organize the data in this way, so we could test how this organization might help to improve performance?

1677224 function calls (1525320 primitive calls) in 87.359 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   87.359   87.359 {built-in method builtins.exec}
        1    0.001    0.001   87.359   87.359 <string>:1(<module>)
        1    0.000    0.000   87.359   87.359 ../hdmf/common/table.py:464(__getitem__)
        4    0.000    0.000   87.347   21.837 ../hdmf/container.py:417(__getitem__)
     3372    0.013    0.000   86.787    0.026 ../hdmf/backends/hdf5/h5_utils.py:92(get_object)
13488/3372    0.078    0.000   86.771    0.026 ../hdmf/utils.py:444(func_call)
     3372    0.027    0.000   86.558    0.026 ../hdmf/backends/hdf5/h5tools.py:337(get_container)
     3372    0.048    0.000   85.780    0.025 ../hdmf/backends/hdf5/h5tools.py:325(get_builder)
     3372   85.556    0.025   85.576    0.025 ../python3.7/site-packages/h5py/_hl/base.py:223(name)
        2    0.007    0.004   60.691   30.345 ../hdmf/backends/hdf5/h5_utils.py:143(__getitem__)
     2248    0.038    0.000   60.680    0.027 ../hdmf/backends/hdf5/h5_utils.py:152(__swap_refs)
     2248    0.081    0.000   60.641    0.027 ../hdmf/backends/hdf5/h5_utils.py:157(__get_ref)
        1    0.000    0.000   26.649   26.649 ../hdmf/backends/hdf5/h5_utils.py:170(__getitem__)
        1    0.040    0.040   26.649   26.649 ../hdmf/backends/hdf5/h5_utils.py:173(<listcomp>)
    13488    0.162    0.000    0.426    0.000 ../hdmf/utils.py:128(__parse_args)
134880/13488    0.167    0.000    0.396    0.000 ../python3.7/copy.py:132(deepcopy)
    13488    0.026    0.000    0.339    0.000 ../python3.7/copy.py:210(_deepcopy_list)
    13488    0.048    0.000    0.260    0.000 ../python3.7/copy.py:236(_deepcopy_dict)
     6744    0.054    0.000    0.242    0.000 ../python3.7/site-packages/h5py/_hl/base.py:216(file)
     3372    0.198    0.000    0.218    0.000 ../python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
...
...
...

bendichter commented 4 years ago

We are working on a library that may help with (1), lazy_ops, which offers lazy views into h5py datasets that can be transposed and sliced without reading data. We could use these view to create SlicedDynamicTable objects that direct to regions of DynamicTables by providing view to portions of a dataset without resolving underlying links.

oruebel commented 4 years ago

@bendichter thanks for the suggestion, I'll take a look. lazy h5py views may be useful under the hood, but I'll need to take a closer look into how to best optimize this particular problem in HDMF. I think we here actually need to get a view into an already wrapped HDF5 dataset of object references.

However, I still think the more fundamental issue is that creating 1000s of small timeseries is inefficient. This issue is just one place where this shows up as a problem, but it fundamentally is an issue any time we have to iterate over and/or load data from all these different objects. I.e., while I think it is important that we deal with the resolution of the object references in HDMF to make that more efficient, I think we also need to also think about options how to store the TimeSeries data more efficiently.

lvsltz commented 4 years ago

Having to iterate over and load data from thousands of objects is problematic

Indeed, that's one of our reasons to transition from igor .ibw files to NWB. Although loading in memory the ibw files for this particular experiment takes ~7s.

Since the IntracellularRecordings table allows us to select ranges in time in a TimeSeries as a response/stimulus, I think we should ideally "concatenate" sequential recordings from the same electrode (i.e., from a Sweep or if feasible maybe even a a SweepSequence) into a single PatchClampSeries [...] change the example file to organize the data in this way, so we could test how this organization might help to improve performance?

Isn't it easy for you guys to just read and rewrite one of the example files in the way you think it's efficient ? I'll give it a try nevertheless.

Also, does that mean we'll lose the original granularity of the data? We'd rather not concatenate for performance reasons just for the user to split after loading the data. Besides, 1000 traces is quite common: in one of the datasets, the median number of traces per experiment is 832 (min: 6, max: 8754). What NWB does under the hood doesn't matter as long as individual traces can easily be retrieved by the user.

oruebel commented 4 years ago

Isn't it easy for you guys to just read and rewrite one of the example files in the way you think it's efficient ? I'll give it a try nevertheless.

The main issue is that I'm not sure which traces should logically be combined. I assume it would probably be one TimeSeries per Sweep or would it be one per SweepSequence? I think this depends in part which parameters change between the recordings.

Also, does that mean we'll lose the original granularity of the data?

Using the extension, I think the main way that users would probably interact with the data is through the tables. The granularity of the data is encoded in the IntracellularRecordings table. Let's take the example of refactoring one of your existing files. The Conditions, Runs, SweepSequences, and Sweeps table would actually not change at all. Also the IntracellularRecordings table would effectively remain the same (i.e. it would still have the same rows), the only thing that would change in the table are the start index and counts (and references) in the stimulus and response columns of the table. The main change would be on the actual TimeSeries where one would group consecutive recordings. I think it will be very interesting to see what this does for us in terms of performance and number of objects in a file.

As I mentioned, I'll look into the performance issues we are seeing with the resolution of references, and I'm confident we can address them at least on access to the tables. But the issue of requiring large numbers of data objects and references is something that we should try and address on the schema side if possible. In general, with the current structure, I think both approaches of one-series-per-recording or one-series-per-sweep should be feasible, so whatever conclusion we come to I think it would ultimately be more of a best-practice rather than a strict requirement. However, having some clear evidence why/when one approach or the other is preferable would I think be very informative for us and more broadly useful for users.

lvsltz commented 4 years ago

I understand. By default, the signal is retrieved and stored in small chunks: there is no such thing as long timeseries in the first place; and indeed many parameters may and do change between two traces. I'll see what I can do for this test.

oruebel commented 4 years ago

@lvsltz thanks for checking on this! At this point, I'm just trying to see what the potential impact of this would be. For this evaluation, what I would like to figure out is:

what is the impact on performance if we create larger time series
what is the impact on acquisition/generation of the files
what are the implications on the schema

From what it sounds like, the impact on the data generation side is, that these are actually acquired as separate series, so we would need to provide users with convenience methods to add new series to a recording that would concatenate the data and add it to the tables. And the implications on the the schema may be that parameters vary between recordings so we need a mechanism to describe this, e.g., by moving parameters from PatchClampSeries to the IntracellularRecordings table. The answer here very well may be that this is too complicated to do or too hard to use, but I still think that this is a useful exercise because it will allow us to give a concrete reasoning for why we choose the solution we chose.

@lvsltz keep me posted on how things are going with generating an example file. In the meantime, I'll work on HDMF (most likely next week) to work out the performance issues on read with the references. Once we have an example file, I can run some performance tests.

oruebel commented 4 years ago

I think for the test you could simply add custom columns to the IntracellularRecordings table for all parameters that vary between recordings. The parameters stored in, e.g., a PatchClampSeries may be wrong in this case (since they may not apply to the whole timeseries), but for this test that should be fine.

lvsltz commented 4 years ago

I didn't manage to get this done. Just to make sure when I'll resume, https://github.com/oruebel/ndx-icephys-meta/issues/25 does not interfere, right? Also, how do I slice a TimeSeries or PatchClampSeries? I've never used epochs before, if that's the answer.

oruebel commented 4 years ago

@lvsltz thanks for keeping me posted. #25 should not interfere as all that would change is to allow TimeSeries more generally in the table. If anything, I think this might make creating the test-file for this issue simpler because we could just put the recording into a plain TimeSeries and put all other metadata into the IntracellularRecordings table.

lvsltz commented 4 years ago

I sent a link on slack with an attempt, where some metadata were lost, but now there are ~ 8 PatchClampSeries in the file.

oruebel commented 4 years ago

Looking at the profiling results again (and after doing some more tests), it seems that the majority of the time is spent in:

     3372   85.556    0.025   85.576    0.025 ../python3.7/site-packages/h5py/_hl/base.py:223(name)

I.e., we spent most of the time on looking up the name of HDF5 objects. The name (i.e., path) is used in HDMF as primary key to cache the builders in the HDF5IO backend. I did some simple tests and it looks like, looking up the id of HDF5 objects is much faster than looking up their name. Since we just need a unique identifier for the object, using the id instead of the name should be fine. The following PR is an attempt at using the id instead of the name for caching https://github.com/hdmf-dev/hdmf/pull/235 and it looks like with this change loading the intracellular recordings table via nwbfile.intracellular_recordings[:] takes ~1s compared to ~80s with the current dev branch of HDMF.

Profile with https://github.com/hdmf-dev/hdmf/pull/235

1667108 function calls (1515204 primitive calls) in 0.973 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
134880/13488    0.123    0.000    0.268    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/copy.py:132(deepcopy)
    13488    0.093    0.000    0.237    0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:120(__parse_args)
     3372    0.072    0.000    0.086    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
   320350    0.066    0.000    0.066    0.000 {method 'get' of 'dict' objects}
     6744    0.057    0.000    0.080    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/files.py:307(__init__)
     2248    0.037    0.000    0.644    0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/backends/hdf5/h5_utils.py:155(__get_ref)
....

Profile with HDMF dev

 1677009 function calls (1525241 primitive calls) in 79.471 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3372   77.897    0.023   77.916    0.023 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/base.py:223(name)
134880/13488    0.157    0.000    0.373    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/copy.py:132(deepcopy)
    13488    0.147    0.000    0.381    0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:120(__parse_args)
     3372    0.127    0.000    0.145    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
     6744    0.106    0.000    0.141    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/files.py:307(__init__)
13488/3372    0.072    0.000   79.010    0.023 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:436(func_call)
     2248    0.069    0.000   54.663    0.024 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/backends/hdf5/h5_utils.py:155(__get_ref)
     6744    0.048    0.000    0.218    0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/base.py:216(file)
33720/13488    0.047    0.000    0.094    0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:25(__type_okay)
....

oruebel commented 4 years ago

With https://github.com/hdmf-dev/hdmf/pull/235 merged this gets us much closer to usable performance. I'm leaving this issue open, because: (1) I think there still some other performance issues that will hopefully get us closer to acceptable performance and (2) I still need to do the comparison with the restructured file to compare performance between the one-timeseries-per-recording vs. one-timeseries-per-sweep(sequence).

oruebel commented 4 years ago

While https://github.com/hdmf-dev/hdmf/pull/238 is not directly related to this specific it helps make the file read for the BlueBrain test files quite a bit faster

lvsltz commented 4 years ago

Benchmarking with some example files that also contain metadata columns:

times refer to the time needed to load a denormalized dataframe df = nwbfile.get_icephys_meta_parent_table().to_denormalized_dataframe();
number of rows and columns to the shape of the output dataframe (i.e. number of recordings and total number of columns in all icephys tables).

Do you think we can optimize these?

File name	Size	# Rows	# Columns	Wall time
001_140327B01_EXP_ndx.nwb	86 MB	1124	156	5min 21s
001_140718EXP_B2_ndx.nwb	96 MB	1134	156	5min 11s
001_140627EXP_2_ndx.nwb	38 MB	566	114	2min 3s
001_140709EXP_A1_ndx.nwb	15 MB	191	114	0min 41s

Files here: https://drive.google.com/open?id=1HQPYqkV6zsyzFoG6qSnHXdkzV4a7HuV2

oruebel commented 4 years ago

@lvsltz thanks for the files and benchmarks.

Do you think we can optimize these?

I do believe that there is ample room for optimizing the performance of this. My guess is that the majority of time is spent on assembling the data and its probably a problem of both recursion and nested Python loops that lead to the poor performance. I would suggest the following approach:

As a baseline, I think we should start with looking at the time to load all the data required to construct the table, i.e, just loop through all the tables and load the data for all columns into memory. This should give us a baseline for the best performance we can expect for this.
We should run the python profiler again to see where the time is spent in to_denormalized_dataframe(). I think part of the problem may be that the DynamicTables (and potentially individual rows of the tables) are converted to Pandas DataFrames and then the frames are assembled, rather than assembling the data directly into a single frame.

oruebel / ndx-icephys-meta

performance issue when indexing table #17