Add standardized framework for wearable validation

raphaelvallat / yasa

YASA (Yet Another Spindle Algorithm): a Python package to analyze polysomnographic sleep recordings.

https://raphaelvallat.com/yasa/

BSD 3-Clause "New" or "Revised" License

417 stars 113 forks source link

Add standardized framework for wearable validation #78

Closed raphaelvallat closed 6 months ago

raphaelvallat commented 2 years ago

I think it could be useful to have a Python implementation of the analytical pipeline for testing sleep-tracking wearable, originally developed (in R) by @Luca-Menghini and @SRI-human-sleep: https://github.com/SRI-human-sleep/sleep-trackers-performance

Menghini, Cellini, Goldstone, Baker and de Zambotti, A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code, Sleep (2021), https://doi.org/10.1093/sleep/zsaa170

More broadly, this analytical pipeline could be used to compare the performance of any sleep staging algorithm against a ground-truth reference (with 2-, 4- or 5-stages). We should also support evaluating the performance against a ground-truth consensus scoring (i.e. 2 or more experts per record).

I would love some help on this if anyone would like to contribute.

remrama commented 2 years ago

Oh I love this. I'm messing with some actigraphy these days, as well as multirater agreement for something else, so I would be happy to try and work on this. I think this pipeline (especially the consensus scoring) will be very valuable since it's easy to predict that reviewers will ask for such stuff in a manuscript that uses YASA. I'm not sure about a timeframe I could promise, but you don't seem like you're in a big rush. I would probably work on this intermittently over a few months. Also I'll leave space for someone with more experience and/or time to jump in and take charge on this too.

What were you thinking @raphaelvallat -- just port over those R functions into a new module? At a glance, it seems the way they have it setup is that you run each function to generate each plot. Or would you want some kind of condensed rater_comparison_report function? Again, I've only glanced at their pipeline, but it seems like a lot of the heavy-lifting would come from functions you've already made in YASA (e.g., sleep_statistics) and pingouin.

Do you have a sample dataset in mind? I suppose the main use you're considering is between the YASA staging and human raters, but since the main idea of the original paper is to compare with actigraphy which might have different epoch sizes, it seems like the pipeline should be built using that (more difficult) use-case to ensure proper handling. I'm sure there's an actigraphy vs PSG dataset out there somewhere, hopefully with more than 1 PSG scorer.

Luca-Menghini commented 2 years ago

Would love this as well! I'm not a Python user, but the @SRI-human-sleep team and I are completely available for any clarification on the R functions and any other aspect of the pipeline. So feel free to write us or set a call!

We might also provide/simulate some datasets if needed, but consider keeping the focus on both binary classifications (e.g., actigraphic scores of sleep/wake) and devices providing 3+ categories (e.g., commercial trackers providing sleep staging), which are becoming increasingly popular.

raphaelvallat commented 2 years ago

Thank you so much for your quick replies @remrama and @Luca-Menghini! And thanks for offering your help @remrama, there is no rush on this so your timeline sounds great — and I'd be happy to help along the way!

@Luca-Menghini if you have some examples of wearable datasets, that would be so helpful. Ideally, we should be flexible and support actigraphy (sleep/wake), wearable (3 or 4 classes) and polysomnography (5 classes).

@remrama I'm actually not sure about how the function should look like. I guess my first choice would be a single Python class with various methods, to avoid redundancy in the functions. Something like:

class PerformanceComparison
    def __init__(y_true, y_pred, stages=["WAKE", "NREM", "REM"])

    def discrepancy_analysis()

    def ebe_analysis()

    def plot_confusion_matrix()

    def plot_bland_altmann()

    etc...

where y_true is a pandas.Series or Datarame with the epoch-by-epoch stages from one or more scorer and y_pred is the predicted sleep stages. The index of the dataframe must be a multi-index (participant, night, ..., epoch) where the last level is always the epoch number (0, 1, ..., n) for the given recording and the first dimension is always the participant_id.

Let me know what you think!

remrama commented 2 years ago

Yep that looks like a great setup @raphaelvallat . Even if you wanted to restructure it later (not expecting that), I think this is a straight-forward way of getting the general output built. I could just work on porting each of @Luca-Menghini 's R functions over into Python/YASA within a single class and then go from there.

remrama commented 2 years ago

Making some progress on this (finally). @Luca-Menghini do you have any relevant datasets you're able to share? It might help to validate this pipeline if we could use the same data you used in the Sleep paper, but of course that might not be possible. If you're able to share a dataset but it needs to stay private, I guess we could email instead, just lmk.

Another ideal-but-maybe-not-possible consideration: For the sake of a notebook tutorial, it'd be best if we could share at least of subset of whatever dataset we end up using.

Let me know what you think, thanks.

remrama commented 2 years ago

For the record -- we had an external meeting about this with the SRI team and made a plan moving forward. Current plan is for SRI to wrap up their own Python implementation of the Menghini paper pipeline and then we'll lead a port of the essentials over into YASA.

remrama commented 1 year ago

I've made some progress here. It's an evaluation module that has two main classes: EpochByEpochEvaluation and SleepStatsEvaluation (the latter being for summary measures like WASO, %N2, etc.). Each class runs a few measures for evaluation, includes some results dataframes, and then has some plotting methods to visualize results. Of course all the measures and plots are modeled after the SRI pipeline.

Note that I don't have the tutorial focused on wearable devices or actigraphy per se, but everything generalizes very easily. I added a simple function that will convert PSG-based sleep stats to wearable-based sleep stats (i.e., groups N1+N2 into "Light" sleep and renames N3 to "Deep" sleep). Add that step and then it's all the same.

I think there are more features that could be added, but that this might be enough output for a first merge with YASA. The code and documentation needs to be cleaned, but I'm wondering if you think the current structure and output is good for a future pull request. If so, I'll start cleaning it up before submitting the request. If not, maybe let me know what you think should be added before a first formal merge. This notebook on my fork gives a rundown of current features.

raphaelvallat commented 1 year ago

@remrama this is really great work! I just had a look at the notebook and it's a great direction. Loved the random hypnogram generation :D

Instead of the new function to convert PSG-based sleep stats to wearables, I would edit the sleepstats function to work natively with 2, 3, 4 or 5 classes. In this function — as well as in the two new classes that you proposed — there should be a parameter to indicate whether the data is coming from a 2,3,4,5-stages scoring. This would then determine the behavior of all the underlying methods/output. For example, the tick labels would be automatically set in the plotting functions.

Such a flexibility would however require a strict input format of the hypnogram. I would suggest the following accepted values:

2-classes: "S" or "W"
3 classes: "W", "R", "NREM"
4 classes: "W", "LIGHT", "DEEP", "R"
5 stages: "W", "N1", "N2", "N3", "R"

remrama commented 1 year ago

That's a great idea. I'll switch to that, clean up the code (black formatting, etc.), and then add a few other small features I've been thinking about and reach back out.

Loved the random hypnogram generation

Ya same I was happy with that :) It's beyond the scope of what I can do right now, but at some point I think a more advanced version of that -- like one that takes all the sleep_statistics dictionary output into account -- would be a cool addition to YASA. It seems like it would fit nicely as another utility in the hypno.py module.