Open cboulay opened 2 years ago
Hey I like the overall idea. It requires some work to implement so I will keep the issue open.
Still, the main equation here to solve is to keep pynapple simple and easy for new users while still being able to provide performance boost for advanced user.
I can see that the overall performances of pynapple will be an issue in a close future so I will try to work on this as soon as possible.
I don't have a good understanding of how independent these objects are supposed to be.
After looking at the code a little more, I think that my original proposal is incorrect. An IntervalSet could be used to restrict multiple TimeStamp or TimeSeries objects (e.g., simultaneous spike trains, task events, and ECoG), so it's not so simple to store references to only the spike train.
I suppose the references could be encapsulated in some dict or other structure that allows for arbitrary reference groups. However, now we're inflating the complexity of IntervalSet.
Another solution, and one that might be possible via plugin, could be to borrow concepts from relational databases. For context, here's a DB that I use for different purposes: https://github.com/cboulay/SERF . Take a look at DatumFeatureValue
for example; its whole purpose is to store pairs of keys pointing to 2 independent objects and a value associated with that relation.
In pynapple, there could be a new object, also inheriting from pandas dataframe, that stores pairs of references + some data. In this example, the references would point to an IntervalSet and to a spike train, and the data would be the n_channels x 2 int array of offsets and counts.
As discussed in the webinar: When working with very large recordings, it is impractical to load the entire multi-channel spike train (100's off channels, multiple hours of recordings) into memory on each processing node in a cluster, especially if each node is only interested in a small segment of data like a single trial.
However, if each node in the cluster had a handle to the NWB file* and a reference to its IntervalSet, and the IntervelSet knew its per-channel offsets and number of spikes that belonged to that set, it could be possible to load only the relevant slice of spike times into memory. (This assumes that the IntervalSet is a first-class object in the h5 hierarchy and can be accessed directly.)
There might be a way of supporting minimal spike train loading on a cluster and it might fit better in pynapples-on-fire, but I thought I'd submit the issue here first because this would be easier to implement in the core and it might have some benefits for other processes.