Open bendichter opened 4 years ago
@bendichter thanks for creating the issue. I'll take a look. I'm not sure if the data read is culprit here or maybe some slow loops in Python. We'll see. Either way, this should not take this long.
Still working on this one. The creation of the table for the intracellular_recordings is already slow, i.e., nwbfile.intracellular_recordings.to_dataframe()
. I'm wondering whether maybe something is slow with the resolution of links.
I profiled the call nwbfile.intracellular_recordings[:]
and it looks like the main problem seems to be the time for resolving the references (see profile trace below). This leads us two issues:
1) Since we are not loading any data from the references in this case, we should be able to delay the resolution to when we actually need the container. This a feature that would need to be added in HDMF 2) The number of TimeSeries used to store this data is very large. There are 1124 intracellular recordings, for each of which we have to resolve 2 unique TimeSeries and 1 Electrode reference. The fact that there are 1124 stimulus and 1124 acquisitions is in general problematic (not just for PyNWB). Having to iterate over and load data from thousands of objects is problematic. Since the IntracellularRecordings table allows us to select ranges in time in a TimeSeries as a response/stimulus, I think we should ideally "concatenate" sequential recordings from the same electrode (i.e., from a Sweep or if feasible maybe even a a SweepSequence) into a single PatchClampSeries. This would help significantly decrease the number of objects in a file and lead to more reasonably sized TimeSeries. @lvsltz do you think it would be possible to change the example file to organize the data in this way, so we could test how this organization might help to improve performance?
1677224 function calls (1525320 primitive calls) in 87.359 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 87.359 87.359 {built-in method builtins.exec}
1 0.001 0.001 87.359 87.359 <string>:1(<module>)
1 0.000 0.000 87.359 87.359 ../hdmf/common/table.py:464(__getitem__)
4 0.000 0.000 87.347 21.837 ../hdmf/container.py:417(__getitem__)
3372 0.013 0.000 86.787 0.026 ../hdmf/backends/hdf5/h5_utils.py:92(get_object)
13488/3372 0.078 0.000 86.771 0.026 ../hdmf/utils.py:444(func_call)
3372 0.027 0.000 86.558 0.026 ../hdmf/backends/hdf5/h5tools.py:337(get_container)
3372 0.048 0.000 85.780 0.025 ../hdmf/backends/hdf5/h5tools.py:325(get_builder)
3372 85.556 0.025 85.576 0.025 ../python3.7/site-packages/h5py/_hl/base.py:223(name)
2 0.007 0.004 60.691 30.345 ../hdmf/backends/hdf5/h5_utils.py:143(__getitem__)
2248 0.038 0.000 60.680 0.027 ../hdmf/backends/hdf5/h5_utils.py:152(__swap_refs)
2248 0.081 0.000 60.641 0.027 ../hdmf/backends/hdf5/h5_utils.py:157(__get_ref)
1 0.000 0.000 26.649 26.649 ../hdmf/backends/hdf5/h5_utils.py:170(__getitem__)
1 0.040 0.040 26.649 26.649 ../hdmf/backends/hdf5/h5_utils.py:173(<listcomp>)
13488 0.162 0.000 0.426 0.000 ../hdmf/utils.py:128(__parse_args)
134880/13488 0.167 0.000 0.396 0.000 ../python3.7/copy.py:132(deepcopy)
13488 0.026 0.000 0.339 0.000 ../python3.7/copy.py:210(_deepcopy_list)
13488 0.048 0.000 0.260 0.000 ../python3.7/copy.py:236(_deepcopy_dict)
6744 0.054 0.000 0.242 0.000 ../python3.7/site-packages/h5py/_hl/base.py:216(file)
3372 0.198 0.000 0.218 0.000 ../python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
...
...
...
We are working on a library that may help with (1), lazy_ops, which offers lazy views into h5py datasets that can be transposed and sliced without reading data. We could use these view to create SlicedDynamicTable
objects that direct to regions of DynamicTable
s by providing view to portions of a dataset without resolving underlying links.
@bendichter thanks for the suggestion, I'll take a look. lazy h5py views may be useful under the hood, but I'll need to take a closer look into how to best optimize this particular problem in HDMF. I think we here actually need to get a view into an already wrapped HDF5 dataset of object references.
However, I still think the more fundamental issue is that creating 1000s of small timeseries is inefficient. This issue is just one place where this shows up as a problem, but it fundamentally is an issue any time we have to iterate over and/or load data from all these different objects. I.e., while I think it is important that we deal with the resolution of the object references in HDMF to make that more efficient, I think we also need to also think about options how to store the TimeSeries data more efficiently.
Having to iterate over and load data from thousands of objects is problematic
Indeed, that's one of our reasons to transition from igor .ibw files to NWB. Although loading in memory the ibw files for this particular experiment takes ~7s.
Since the IntracellularRecordings table allows us to select ranges in time in a TimeSeries as a response/stimulus, I think we should ideally "concatenate" sequential recordings from the same electrode (i.e., from a Sweep or if feasible maybe even a a SweepSequence) into a single PatchClampSeries [...] change the example file to organize the data in this way, so we could test how this organization might help to improve performance?
Isn't it easy for you guys to just read and rewrite one of the example files in the way you think it's efficient ? I'll give it a try nevertheless.
Also, does that mean we'll lose the original granularity of the data? We'd rather not concatenate for performance reasons just for the user to split after loading the data. Besides, 1000 traces is quite common: in one of the datasets, the median number of traces per experiment is 832 (min: 6, max: 8754). What NWB does under the hood doesn't matter as long as individual traces can easily be retrieved by the user.
Isn't it easy for you guys to just read and rewrite one of the example files in the way you think it's efficient ? I'll give it a try nevertheless.
The main issue is that I'm not sure which traces should logically be combined. I assume it would probably be one TimeSeries per Sweep or would it be one per SweepSequence? I think this depends in part which parameters change between the recordings.
Also, does that mean we'll lose the original granularity of the data?
Using the extension, I think the main way that users would probably interact with the data is through the tables. The granularity of the data is encoded in the IntracellularRecordings table. Let's take the example of refactoring one of your existing files. The Conditions, Runs, SweepSequences, and Sweeps table would actually not change at all. Also the IntracellularRecordings table would effectively remain the same (i.e. it would still have the same rows), the only thing that would change in the table are the start index and counts (and references) in the stimulus and response columns of the table. The main change would be on the actual TimeSeries where one would group consecutive recordings. I think it will be very interesting to see what this does for us in terms of performance and number of objects in a file.
As I mentioned, I'll look into the performance issues we are seeing with the resolution of references, and I'm confident we can address them at least on access to the tables. But the issue of requiring large numbers of data objects and references is something that we should try and address on the schema side if possible. In general, with the current structure, I think both approaches of one-series-per-recording or one-series-per-sweep should be feasible, so whatever conclusion we come to I think it would ultimately be more of a best-practice rather than a strict requirement. However, having some clear evidence why/when one approach or the other is preferable would I think be very informative for us and more broadly useful for users.
I understand. By default, the signal is retrieved and stored in small chunks: there is no such thing as long timeseries in the first place; and indeed many parameters may and do change between two traces. I'll see what I can do for this test.
@lvsltz thanks for checking on this! At this point, I'm just trying to see what the potential impact of this would be. For this evaluation, what I would like to figure out is:
From what it sounds like, the impact on the data generation side is, that these are actually acquired as separate series, so we would need to provide users with convenience methods to add new series to a recording that would concatenate the data and add it to the tables. And the implications on the the schema may be that parameters vary between recordings so we need a mechanism to describe this, e.g., by moving parameters from PatchClampSeries to the IntracellularRecordings table. The answer here very well may be that this is too complicated to do or too hard to use, but I still think that this is a useful exercise because it will allow us to give a concrete reasoning for why we choose the solution we chose.
@lvsltz keep me posted on how things are going with generating an example file. In the meantime, I'll work on HDMF (most likely next week) to work out the performance issues on read with the references. Once we have an example file, I can run some performance tests.
I think for the test you could simply add custom columns to the IntracellularRecordings table for all parameters that vary between recordings. The parameters stored in, e.g., a PatchClampSeries may be wrong in this case (since they may not apply to the whole timeseries), but for this test that should be fine.
I didn't manage to get this done. Just to make sure when I'll resume, https://github.com/oruebel/ndx-icephys-meta/issues/25 does not interfere, right? Also, how do I slice a TimeSeries or PatchClampSeries? I've never used epochs before, if that's the answer.
@lvsltz thanks for keeping me posted. #25 should not interfere as all that would change is to allow TimeSeries more generally in the table. If anything, I think this might make creating the test-file for this issue simpler because we could just put the recording into a plain TimeSeries and put all other metadata into the IntracellularRecordings table.
I sent a link on slack with an attempt, where some metadata were lost, but now there are ~ 8 PatchClampSeries in the file.
Looking at the profiling results again (and after doing some more tests), it seems that the majority of the time is spent in:
3372 85.556 0.025 85.576 0.025 ../python3.7/site-packages/h5py/_hl/base.py:223(name)
I.e., we spent most of the time on looking up the name
of HDF5 objects. The name
(i.e., path) is used in HDMF as primary key to cache the builders in the HDF5IO backend. I did some simple tests and it looks like, looking up the id
of HDF5 objects is much faster than looking up their name
. Since we just need a unique identifier for the object, using the id
instead of the name
should be fine. The following PR is an attempt at using the id
instead of the name
for caching https://github.com/hdmf-dev/hdmf/pull/235 and it looks like with this change loading the intracellular recordings table via nwbfile.intracellular_recordings[:]
takes ~1s compared to ~80s with the current dev branch of HDMF.
Profile with https://github.com/hdmf-dev/hdmf/pull/235
1667108 function calls (1515204 primitive calls) in 0.973 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
134880/13488 0.123 0.000 0.268 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/copy.py:132(deepcopy)
13488 0.093 0.000 0.237 0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:120(__parse_args)
3372 0.072 0.000 0.086 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
320350 0.066 0.000 0.066 0.000 {method 'get' of 'dict' objects}
6744 0.057 0.000 0.080 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/files.py:307(__init__)
2248 0.037 0.000 0.644 0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/backends/hdf5/h5_utils.py:155(__get_ref)
....
Profile with HDMF dev
1677009 function calls (1525241 primitive calls) in 79.471 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3372 77.897 0.023 77.916 0.023 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/base.py:223(name)
134880/13488 0.157 0.000 0.373 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/copy.py:132(deepcopy)
13488 0.147 0.000 0.381 0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:120(__parse_args)
3372 0.127 0.000 0.145 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/group.py:253(__getitem__)
6744 0.106 0.000 0.141 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/files.py:307(__init__)
13488/3372 0.072 0.000 79.010 0.023 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:436(func_call)
2248 0.069 0.000 54.663 0.024 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/backends/hdf5/h5_utils.py:155(__get_ref)
6744 0.048 0.000 0.218 0.000 /Users/oruebel/anaconda3/envs/py4nwb/lib/python3.7/site-packages/h5py/_hl/base.py:216(file)
33720/13488 0.047 0.000 0.094 0.000 /Users/oruebel/Devel/nwb/hdmf/src/hdmf/utils.py:25(__type_okay)
....
With https://github.com/hdmf-dev/hdmf/pull/235 merged this gets us much closer to usable performance. I'm leaving this issue open, because: (1) I think there still some other performance issues that will hopefully get us closer to acceptable performance and (2) I still need to do the comparison with the restructured file to compare performance between the one-timeseries-per-recording vs. one-timeseries-per-sweep(sequence).
While https://github.com/hdmf-dev/hdmf/pull/238 is not directly related to this specific it helps make the file read for the BlueBrain test files quite a bit faster
Benchmarking with some example files that also contain metadata columns:
df = nwbfile.get_icephys_meta_parent_table().to_denormalized_dataframe()
;recordings
and total number of columns in all icephys tables).Do you think we can optimize these?
File name | Size | # Rows | # Columns | Wall time |
---|---|---|---|---|
001_140327B01_EXP_ndx.nwb | 86 MB | 1124 | 156 | 5min 21s |
001_140718EXP_B2_ndx.nwb | 96 MB | 1134 | 156 | 5min 11s |
001_140627EXP_2_ndx.nwb | 38 MB | 566 | 114 | 2min 3s |
001_140709EXP_A1_ndx.nwb | 15 MB | 191 | 114 | 0min 41s |
Files here: https://drive.google.com/open?id=1HQPYqkV6zsyzFoG6qSnHXdkzV4a7HuV2
@lvsltz thanks for the files and benchmarks.
Do you think we can optimize these?
I do believe that there is ample room for optimizing the performance of this. My guess is that the majority of time is spent on assembling the data and its probably a problem of both recursion and nested Python loops that lead to the poor performance. I would suggest the following approach:
It seems the auto-dereferencing dynamic tables code is struggling with these nested tables. Test data is here. On my computer, after setting up the file read
this line:
took 57 seconds to create a 1x1 pandas DataFrame. Is this how the API is intended to be used? I think this is due to the fact that indexing the downstream tables now reads all of that data into memory while creating a pandas DataFrame.