fix 'default_view_length' input parameter

Thank you very much!

Since you are here, can I ask how you found my repository?

Also, if you have any suggestions for improvements or other feedback, do let me know.

Hello Paul,

Thank you for your email and for developing the toolbox. I did my PhD in Oxford with Martin Kahn and he introduced me to somnotate, as we are also interested in sleep scoring here in the Maguire lab. Martin told me that somnotate worked great for him. I am looking forward to testing the model on our data.

I just got the pipelines to work yesterday so I only have some feedback on getting the toolbox to work.

1) I think it will help the user if there is an example spreadsheet file and maybe the folder structure that it's based on. Martin gave me his spreadsheet file and that helped me a lot. 2) Also we used sirenia for sleep scoring so I had to write my own script for conversion to visbrain format. I think it might be easier to provide an example file with a description of the visbrain hypnogram. It could be even simpler if the file format is more generic, for example removing the first two header lines. 3) I had to move the pipeline scripts to the same level as somnotate to get them to work (out of the example pipeline folder) but it could be that I was doing something wrong.

If the model works with our data, I am planning to develop pipelines that may be a bit easier to use for people that don't code. I have created a branch for that purpose and I was planning to send you an email when that was complete to ask you where to place the code.

I also wanted to ask you about the model training since I have never trained HMMs. For example, deep neural nets need lots of data to be trained. Are bigger datasets better for HMM training or does the performance saturate fast? I was planning to use 24-48 hour files for training the model for one animal and then getting predictions for the rest of the recordings of that animal. Have you tried training a model across a group of animals and then testing it on another group?

Thanks, Pantelis

On Wed, Jun 30, 2021 at 5:47 AM Paul Brodersen @.***> wrote:

Since you are here, can I ask how you found my repository?

Also, if you have any suggestions for improvements or other feedback, do let me know.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/paulbrodersen/somnotate/pull/5#issuecomment-871255611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7742UTH5NNWYAP4NEJOHTTVLR2RANCNFSM47RC5HDQ .

Hi Pantelis,

thank you for the great feedback! I will try to address your points as best as I can.

1) I think it will help the user if there is an example spreadsheet file and maybe the folder structure that it's based on. Martin gave me his spreadsheet file and that helped me a lot.

Completely agree. Once we have submitted the manuscript (it was a long journey for reasons unrelated to the science), the somnotate readme will link to an example data set (the data set that most of the paper will be based on) including a spreadsheet, etc. The folder structure shouldn't matter, as long as the paths in the spreadsheet all point to the right places.

2) Also we used sirenia for sleep scoring so I had to write my own script for conversion to visbrain format.

Yeah, the data I/O is still very basic and not very user friendly. However, I have tried to make it very easy to change the I/O functions. If you have a look at the end of the data_io.py file, I define a list of aliases.

# aliases
load_dataframe = pandas.read_csv
load_raw_signals = _load_edf_file
load_preprocessed_signals = np.load
export_preprocessed_signals = np.save
load_hypnogram = _load_visbrain_hypnogram
export_hypnogram = _export_visbrain_hypnogram

Everywhere else in the pipeline, only those alias names are then used, not the functions themselves. The idea is that you can then easily add a function to data_io.py, say

def load_sirenia_hypnogram(filepath, *args, **kwargs):
    ...
    return states, intervals

and then only change the alias to point to the new function instead.

load_hypnogram = load_sirenia_hypnogram

I tried to explain this idea in the Readme under "Customization" but maybe that section needs work and/or should not be at the very end of a very long Readme.

I think it might be easier to provide an example file with a description of the visbrain hypnogram.

There definitely will be example hypnograms in the linked data set. I just can't release that data until the publication is up.

It could be even simpler if the file format is more generic, for example removing the first two header lines.

I am not a fan of creating new file formats, for reasons best explained in this xkcd.

Overall, I agree that the current state of data I/O customization is not satisfactory. So far, I have been trying to find out what file formats are actually used in the wild. If the list of formats is not too long (and so far it isn't), and the formats are very rigidly defined and adhered to, then it makes sense to ship a few different input functions with somnotate. The idea is that the user then chooses the correct function using command line flags, the config file, or potentially a GUI some day. However, not all hypnogram formats are sufficiently well defined. For example, Martin & Co. used SleepSign, which doesn't have a "canonical" output format; instead the output depends on a whole host of flags the user can set. So somnotate will probably never cater to everybody's favourite file format. On that note, if you think the output from sirenia is predictable and stable, and you want to make another PR against data_io.py with a function that loads sirenia hypnogram files, then I would be more than happy to include it. Also, if you could send me one or two example hypnograms made by sirenia, that would be great.

3) I had to move the pipeline scripts to the same level as somnotate to get them to work (out of the example pipeline folder) but it could be that I was doing something wrong.

That should not be necessary. What operating system are you using, and which python version are you running, and which folder are you in when calling the pipeline scripts?

I also wanted to ask you about the model training since I have never trained HMMs. For example, deep neural nets need lots of data to be trained. Are bigger datasets better for HMM training or does the performance saturate fast?

HMMs need orders of magnitude less data than neural networks. That is one of the main reasons I am using an HMM instead of a recurrent neural network (which would have been more fun to code, to be honest).

I was planning to use 24-48 hour files for training the model for one animal and then getting predictions for the rest of the recordings of that animal. Have you tried training a model across a group of animals and then testing it on another group?

I have trialed this exact approach (train on a single 24 hour data set, test on another set of 24 hours of data from the same animal). What is simpler, and ends up working a little bit better than the paired approach is to simply train on a bunch of different animals with varying data quality (at least 5 but you may see ever so slight improvements with up to 20 animals), and then use that model on everything else. The more variation there is in your training data sets, the better.

performance_versus_total_training_datasets

[The line corresponds to the median accuracy according to a single manual annotation (the accuracy is better when multiple people annotate the same file). Error bars correspond to the 5th and 95th percentile.]

The reason is that the model includes a feature extraction step using linear discriminant analysis. If some feature is not a reliable indicator of any of the vigilance states, then the model learns not to trust it (very much). This makes the model surprisingly immune to a lot of variation in the input signals, in particular if you include some "abnormal" data sets in the training set. As long as the electrodes are roughly in the same place, this "train once on everything you have annotated so far, then forget" approach works extremely well. If you come across an instance where it doesn't, please do let me know. I would love to find data sets that break the model.

I hope this addresses everything in some form. If not, do let me know. I am off to a dinner now but will check in again tomorrow.

Hi Paul,

Thank you very much for your comprehensive reply. I definitely agree with you regarding data formats and I understand the difficulty of generalizing the scripts. I had the same problem with other data and was impossible to find one good format. I am sure that when people are interested in the model, they can either write their own code, or reach to you to help them get their data in the right format. I think as long as you provide example files with instructions that should be sufficient. It was for me at least.

I am running windows OS with python 3.8. I cd'ed to the somnotate parent path from the command line and run $"python example_pipeline/01_preprocess_signals.py /path/to/spreadsheet_A.csv". *"Traceback (most recent call last): *File "example_pipeline\06_compare_state_annotations.py", line 14, in

from somnotate._utils import convert_state_intervals_to_state_vector, _get_intervalsModuleNotFoundError: No module named 'somnotate'"

I haven't seen enough sirenia hypnograms yet, but once I establish the pipelines I can send you example hypnograms and the script used to read them in the visbrain format. I will definitely update you on the model detection and I am happy to provide you with details and metrics on our data (we use typically LFP, EEG, EMG).

It's great that the model generalizes across animals! Best of luck with the paper and I will be hopefully in touch soon.

All the best, Pantelis

On Wed, Jun 30, 2021 at 2:03 PM Paul Brodersen @.***> wrote:

Hi Pantelis,

thank you for the great feedback! I will try to address your points as best as I can.

I think it will help the user if there is an example spreadsheet file and maybe the folder structure that it's based on. Martin gave me his spreadsheet file and that helped me a lot.

Completely agree. Once we have submitted the manuscript (it was a long journey for reasons unrelated to the science), the somnotate readme will link to an example data set (the data set that most of the paper will be based on) including a spreadsheet, etc. The folder structure shouldn't matter, as long as the paths in the spreadsheet all point to the right places.

Also we used sirenia for sleep scoring so I had to write my own script for conversion to visbrain format.

Yeah, the data I/O is still very basic and not very user friendly. However, I have tried to make it very easy to change the I/O functions. If you have a look at the end of the data_io.py file, I define a list of aliases.

aliasesload_dataframe = pandas.read_csvload_raw_signals = _load_edf_fileload_preprocessed_signals = np.loadexport_preprocessed_signals = np.saveload_hypnogram = _load_visbrain_hypnogramexport_hypnogram = _export_visbrain_hypnogram

Everywhere else in the pipeline, only those alias names are then used, not the functions themselves. The idea is that you can then easily add a function to data_io.py, say

def load_sirenia_hypnogram(filepath, *args, **kwargs): ... return states, intervals

and then only change the alias to point to the new function instead.

load_hypnogram = load_sirenia_hypnogram

I tried to explain this idea in the Readme under "Customization" but maybe that section needs work and/or should not be at the very end of a very long Readme.

I think it might be easier to provide an example file with a description of the visbrain hypnogram.

There definitely will be example hypnograms in the linked data set. I just can't release that data until the publication is up.

It could be even simpler if the file format is more generic, for example removing the first two header lines.

I am not a fan of creating new file formats, for reasons best explained in this xkcd https://xkcd.com/927/.

Overall, I agree that the current state of data I/O customization is not satisfactory. So far, I have been trying to find out what file formats are actually used in the wild. If the list of formats is not too long (and so far it isn't), and the formats are very rigidly defined and adhered to, then it makes sense to ship a few different input functions with somnotate. The idea is that the user then chooses the correct function using command line flags, the config file, or potentially a GUI some day. However, not all hypnogram formats are sufficiently well defined. For example, Martin & Co. used SleepSign, which doesn't have a "canonical" output format; instead the output depends on a whole host of flags the user can set. So somnotate will probably never cater to everybody's favourite file format. On that note, if you think the output from sirenia is predictable and stable, and you want to make another PR against data_io.py with a function that loads sirenia hypnogram files, then I would be more than happy to include it. Also, if you could send me one or two example hypnograms made by sirenia, that would be great.

I had to move the pipeline scripts to the same level as somnotate to get them to work (out of the example pipeline folder) but it could be that I was doing something wrong.

That should not be necessary. What operating system are you using, and which python version are you running, and which folder are you in when calling the pipeline scripts?

I also wanted to ask you about the model training since I have never trained HMMs. For example, deep neural nets need lots of data to be trained. Are bigger datasets better for HMM training or does the performance saturate fast?

HMMs need orders of magnitude less data than neural networks. That is one of the main reasons I am using an HMM instead of a recurrent neural network (which would have been more fun to code, to be honest).

I was planning to use 24-48 hour files for training the model for one animal and then getting predictions for the rest of the recordings of that animal. Have you tried training a model across a group of animals and then testing it on another group?

I have trialed this exact approach (train on a single 24 hour data set, test on another set of 24 hours of data from the same animal). What is simpler, and ends up working a little bit better than the paired approach is to simply train on a bunch of different animals with varying data quality (at least 5 but you may see ever so slight improvements with up to 20 animals), and then use that model on everything else. The more variation there is in your training data sets, the better.

[image: performance_versus_total_training_datasets] https://user-images.githubusercontent.com/8046146/124009168-2f508680-d9d5-11eb-9b07-47ee717bb1eb.png

[The line corresponds to the median accuracy according to a single manual annotation (the accuracy is better when multiple people annotate the same file). Error bars correspond to the 5th and 95th percentile.]

The reason is that the model includes a feature extraction step using linear discriminant analysis. If some feature is not a reliable indicator of any of the vigilance states, then the model learns not to trust it (very much). This makes the model surprisingly immune to a lot of variation in the input signals, in particular if you include some "abnormal" data sets in the training set. As long as the electrodes are roughly in the same place, this "train once on everything you have annotated so far, then forget" approach works extremely well. If you come across an instance where it doesn't, please do let me know. I would love to find data sets that break the model.

I hope this addresses everything in some form. If not, do let me know. I am off to a dinner now but will check in again tomorrow.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/paulbrodersen/somnotate/pull/5#issuecomment-871617094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7742QJSEKX7ANBIRX6N43TVNL6VANCNFSM47RC5HDQ .

I am running windows OS with python 3.8. I cd'ed to the somnotate parent path from the command line and run $"python example_pipeline/01_preprocess_signals.py /path/to/spreadsheet_A.csv"

That is exactly what I do on my machine.... I am on python 3.7 but the last time the python import behaviour changed was from python 2 to python 3, so I don't think that is it. I will check if windows is doing something funky to the PYTHONPATH (I am on linux). Speaking of PYTHONPATH, are you using anaconda (or virtual environments in general)?

I am in submission hell at the moment, and since you have found a workaround I won't drop everything else to fix this. But once things calm down, I will take a closer look. Might be a little while though.

Hi Paul,

I use either miniconda or pipenv depending on the project (In this case miniconda). Don't worry about the path issue, I just wanted to bring it to your attention.

Best, Pantelis

On Thu, Jul 1, 2021 at 12:29 PM Paul Brodersen @.***> wrote:

I am running windows OS with python 3.8. I cd'ed to the somnotate parent path from the command line and run $"python example_pipeline/01_preprocess_signals.py /path/to/spreadsheet_A.csv"

That is exactly what I do on my machine.... I am on python 3.7 but the last time the python import behaviour changed was from python 2 to python 3, so I don't think that is it. I will check if windows is doing something funky to the PYTHONPATH. Speaking of PYTHONPATH, are you using anaconda (or virtual environments in general)?

I am in submission hell at the moment, and since you have found a workaround I won't drop everything else to fix this. But once things calm down, I will take a closer look. Might be a little while though.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/paulbrodersen/somnotate/pull/5#issuecomment-872387739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7742VI6QUCMWTTQXM37ALTVSJXPANCNFSM47RC5HDQ .

paulbrodersen / somnotate

fix 'default_view_length' input parameter #5

aliasesload_dataframe = pandas.read_csvload_raw_signals = _load_edf_fileload_preprocessed_signals = np.loadexport_preprocessed_signals = np.saveload_hypnogram = _load_visbrain_hypnogramexport_hypnogram = _export_visbrain_hypnogram