Closed marcoross closed 2 months ago
This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded).
Yes, this could be useful, I'm not sure how to best show it though, since (with this PR) lazy loading is done on a per-signal basis :thinking:
Sometimes, it would also be useful to know the total size of the
Edf
object in memory. I don't think this is currently possible, right?
Currently not, no... maybe this could be combined with your first suggestion, i.e. display the size of the loaded signals and the total size on disk? E.g. <Edf 20 signals 0 annotations 50/100 MB loaded>
Also, I assume loaded vs. not loaded applies to all signals that are part of an
Edf
objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?
With the implementation suggested in this PR, it could be anything in between as well.
Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as
Edf.data
, andEdf.get_signal()
requires a label. Sorry if I'm missing something obvious!
Right, currently there's nothing like that. While this could be nice for recordings with uniform sampling frequencies (just return a 2d-array), I'm not sure how to best treat those with differing ones (a list of arrays? fail? ...?). What use case are you thinking about for this? (-> a new issue would be nice here, yes)
I've created a new issue to discuss accessing the data array in #49.
Regarding the other points, I think it would be nice not to overcomplicate the API. So you are saying it is currently possible to have some signals loaded in memory and some not, because each signal is treated as a separate EdfSignal
(which is basically a memory-mapped NumPy array)? I'm thinking that for users, it might be useful to know how much memory their entire Edf
object currently uses.
On the other hand, when loading an EDF file, either no signals or all signals are loaded in memory, so there is no way to influence individual signals being loaded with read_edf()
. I guess the only way to do load individual signals (from a lazy object) is to access a signal, but then the question is which methods trigger/require loading the data?
[...] each signal is treated as a separate
EdfSignal
(which is basically a memory-mapped NumPy array)?
Exactly, with this PR the array stays memory mapped until the data is accessed for the first time.
I'm thinking that for users, it might be useful to know how much memory their entire
Edf
object currently uses.
Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?
[...] which methods trigger/require loading the data?
With the currently suggested implementation, this would be
EdfSignal.data
EdfSignal.digital_data
EdfSignal.update_data()
Edf.update_data_record_duration()
(all signals)Edf.write()
(all signals)Edf.annotations
(all annotation signals)Edf.starttime
(timekeeping annotation signal)EDIT: slicing operations would also require loading the data (thanks @cbrnr!):
Edf.slice_between_annotations()
(all signals)Edf.slice_between_seconds()
(all signals)Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?
Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal
should also show its memory consumption in its repr (and attribute).
With the currently suggested implementation, this would be ...
What about slicing between seconds or annotations? In any case, if the current memory consumption is available, users will always be able to find out which operation loads the data!
Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying
EdfSignal
should also show its memory consumption in its repr (and attribute).
:+1: Feel free to open a PR for that, ideally once this one is merged!
What about slicing between seconds or annotations?
Right, thanks for pointing this out! I'll edit the above list. For non-annotation signals this could even be done without loading the data (as long as the desired slice only contains complete datarecords), though that would complicate the implementation a bit.
Another question that I had was that if read_edf()
loads the signal into memory, it actually loads the digital data, right? So I'm not sure how surprising it will be that the physical data consumes four times as much?
What I'm trying to say is that this is becoming a bit complicated already, given that the original intention of lazy loading was that people sometimes work with EDF headers only, so they don't need the data. Maybe a simpler option would be to add a separate function read_edf_header()
instead?
@cbrnr, let's move the discussion about memory consumption information into a new issue!
Sure! I just thought that maybe the current implementation is not even necessary because it adds too much complexity...
For the original use case described in #47 you're definitely right :D However, it's a nice way to also speed up working with only a (small) subset of signals.
This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded). Sometimes, it would also be useful to know the total size of the
Edf
object in memory. I don't think this is currently possible, right?Also, I assume loaded vs. not loaded applies to all signals that are part of an
Edf
objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as
Edf.data
, andEdf.get_signal()
requires a label. Sorry if I'm missing something obvious!