the-siesta-group / edfio

Read and write EDF/EDF+ files.
Apache License 2.0
25 stars 5 forks source link

Implement lazy loading of raw data for local files #48

Closed marcoross closed 2 months ago

cbrnr commented 2 months ago

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded). Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

hofaflo commented 2 months ago

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded).

Yes, this could be useful, I'm not sure how to best show it though, since (with this PR) lazy loading is done on a per-signal basis :thinking:

Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Currently not, no... maybe this could be combined with your first suggestion, i.e. display the size of the loaded signals and the total size on disk? E.g. <Edf 20 signals 0 annotations 50/100 MB loaded>

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

With the implementation suggested in this PR, it could be anything in between as well.

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

Right, currently there's nothing like that. While this could be nice for recordings with uniform sampling frequencies (just return a 2d-array), I'm not sure how to best treat those with differing ones (a list of arrays? fail? ...?). What use case are you thinking about for this? (-> a new issue would be nice here, yes)

cbrnr commented 2 months ago

I've created a new issue to discuss accessing the data array in #49.

Regarding the other points, I think it would be nice not to overcomplicate the API. So you are saying it is currently possible to have some signals loaded in memory and some not, because each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)? I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

On the other hand, when loading an EDF file, either no signals or all signals are loaded in memory, so there is no way to influence individual signals being loaded with read_edf(). I guess the only way to do load individual signals (from a lazy object) is to access a signal, but then the question is which methods trigger/require loading the data?

hofaflo commented 2 months ago

[...] each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)?

Exactly, with this PR the array stays memory mapped until the data is accessed for the first time.

I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

[...] which methods trigger/require loading the data?

With the currently suggested implementation, this would be

EDIT: slicing operations would also require loading the data (thanks @cbrnr!):

cbrnr commented 2 months ago

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

With the currently suggested implementation, this would be ...

What about slicing between seconds or annotations? In any case, if the current memory consumption is available, users will always be able to find out which operation loads the data!

hofaflo commented 2 months ago

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

:+1: Feel free to open a PR for that, ideally once this one is merged!

What about slicing between seconds or annotations?

Right, thanks for pointing this out! I'll edit the above list. For non-annotation signals this could even be done without loading the data (as long as the desired slice only contains complete datarecords), though that would complicate the implementation a bit.

cbrnr commented 2 months ago

Another question that I had was that if read_edf() loads the signal into memory, it actually loads the digital data, right? So I'm not sure how surprising it will be that the physical data consumes four times as much?

What I'm trying to say is that this is becoming a bit complicated already, given that the original intention of lazy loading was that people sometimes work with EDF headers only, so they don't need the data. Maybe a simpler option would be to add a separate function read_edf_header() instead?

hofaflo commented 2 months ago

@cbrnr, let's move the discussion about memory consumption information into a new issue!

cbrnr commented 2 months ago

Sure! I just thought that maybe the current implementation is not even necessary because it adds too much complexity...

hofaflo commented 2 months ago

For the original use case described in #47 you're definitely right :D However, it's a nice way to also speed up working with only a (small) subset of signals.