BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models

adam2392 commented 2 years ago

The dataset referenced in https://github.com/agramfort/artifact-learn/issues/1#issuecomment-906141483 has good and bad components labeled after ICA.

To facilitate easy training/testing, it would be good to construct a BIDsification script to convert the dataset into BIDs format sometime using mne-bids.

cc: @jacobf18 @mscheltienne @anandsaini024

adam2392 commented 2 years ago

For that matter, bidsification of as many datasets as possible, but organized semantically according to the criterion that was specified in https://github.com/jacobf18/iclabel-python/issues/5#issuecomment-1076431272 (i.e. different recording montages, hardware, etc.)

mscheltienne commented 2 years ago

I can have a look at the bidsification of this dataset, as I also will have to do the ANT and EGI ones as well anyway. I never used mne-bids yet, but that was on my to-do list for a long time 😄

adam2392 commented 2 years ago

Some things to think about:

upper bound on the number of seconds included in an ICA decomposition (e.g. a minute)

mscheltienne commented 2 years ago

For the raw EEG datasets that can be used for benchmarking, I have:

ANT Neuro: I included 19 recordings of 4 minutes for now, and I can add more later.
EGI: Still working on bidsification.

It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.

Additional points to keep in mind for the processing of those datasets into labelled ICs:

BP filter applied before ICA. (1, 100) Hz? I think that's what is used to train ICLabel.
Number of components included in the ICA decomposition
ICA algorithm used

For winkler et al. I looked quickly and the ICA decomposition seems to use: https://www.jmlr.org/papers/volume5/ziehe04a/ziehe04a.pdf I'm not familiar with this algo, so I have to go over the paper to figure out what A_ffdiag and W_ffdiag represents. Each file has the following attributes:

A_ffdiag (n x 30), W_ffdiag (30 x 3) with n around 120-ish
cnt: I think that's information about the raw data
filename: oddball is the only keyword I recognize in file names like 'oddball_fasor_B3_K2_VPii'
goodcomp: either (1, k) or (k, ) (with k up to 30) containing the IDx of the good components. 0-index or 1-index?
mnt: dunno
nComps: dunno.. int from 30 to 810?

We basically have 30 IC / file labeled.

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

adam2392 commented 2 years ago

It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.

Perhaps for now, we can share via OneDrive, or Dropbox? I have access to OneDrive via institution still and can setup that if you guys don't have Dropbox pro?

Open to other ideas too.

I think long-term we want to store it on openneuro.org if that's okay? We might actually be able to leverage openneuro.org right away if you're okay with it. We can store and create private BIDSified datasets that we can then pull from even. I think programmatic access would require the dataset to be public(?)

But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.

Yeah there is no agreed-upon spec yet for ICA, but I think we should just follow the "format" that is suggested for derivatives and ICA. E.g. filenaming and directory structure at the very minimum. We can store files in the ICA format output by MNE-Python.

mscheltienne commented 2 years ago

Good idea for openneuro.org, I'll check if we can make those datasets public. Else OneDrive is a good option. Sounds reasonable for the ICA, let's go with FIFF then, I'll try to convert those .mat files.

adam2392 commented 2 years ago

Other datasets: https://github.com/agramfort/artifact-learn/tree/master/1-%20extract%20basic%20info%20from%20databases

looks like we can probably download online?

adam2392 commented 2 years ago

For more data and inspiration, we can look at this review paper: https://iopscience.iop.org/article/10.1088/1741-2560/12/3/031001/pdf

adam2392 commented 2 years ago

For references: mara -> BIDS: https://gist.github.com/mscheltienne/680f46336aec8a0408f30952c3d72e8d

ANT to BIDS: https://gist.github.com/mscheltienne/fe3dcc7dafef7539018a6a00ba73afed

adam2392 commented 2 years ago

See: https://github.com/adam2392/improve_icalabel for now scripts centralized into one repo.

adam2392 commented 2 years ago

Some open questions that we can defer to later of course:

how do we annotate components without the raw data?
the issue in mne-python is that it requires an inst in order to apply the fitted ICA to get the estimated "sources". We need to probably hack an API for interfacing with the stored ICA time-series. Perhaps we just use them as RawArray?
how to run the benchmarks on these two settings?

@anandsaini024 do you have any existing code for building out benchmark models that you want to push up to the improve-icalabel repo?

Anything that you are able to work on while we sort out the GUI and pipeline for annotating the data?

mscheltienne commented 2 years ago

+1 for the IC time-series as RawArray, with an extension e.g. '*-sources-raw.fif' for the MARA dataset.

In this convert function:

https://github.com/adam2392/improve_icalabel/blob/96522dacd045a5caa50f7f4653d9fd988a29bfa1/mnestudy/ica_to_bids/mara.py#L35-L44 For each iteration, the missing steps are to save the ICA ica with an extension -ica.fif, save the IC time-series sources as a RawArray, save the good/bad components in a sidecar with the annotation function you added (all that at the correct BIDS Path).

@anandsaini024 This is what I briefly described to you this evening. It would be great if you could finish this conversion function.

anandsaini024 commented 2 years ago

Alright, I will pick this up.

adam2392 commented 2 years ago

@anandsaini024 have you preprocessed the ANTS dataset already? I am going to use one of the subjects as a test subject for the hs student to QA his annotations.

If you did, do you mind pushing up the script to mnestudy/ica_to_bids/ants.py or somewhere there?

mscheltienne commented 2 years ago

@adam2392 You should have received something from openneuro on your Gmail for dataset ds004178. It contains the ANT Neuro raw files and the preprocessed (pp) files with their ICA decomposition. It is however an automatic pipeline, so please have a look at the preprocessed data when you load it. Also, I did not check when I picked up those files, but maybe some of them are bad recordings with a lot of bridged electrodes (it does happen from time to time). If this is the case, then the recording has to be excluded and I can provide a different one instead.

Note: I deleted the old dataset with only raw data and replaced it with this one.. I did not figure out how to easily update the existing dataset 🤯

adam2392 commented 2 years ago

Can you share with me again?

Yeah updating is a pain. Adding files is easy, Deleting files is kind of a pain. Modifying files is a super pain.

Then my plan for the hs student is to:

determine bad electrodes in the ANT raw dataset
run ICA (with my help)
determine ICA components
maybe do some benchmarking using existing sklearn classifiers (with my help)

It seems we might not be able to get him to fully annotate the ICA components as desired, but hopefully we can get at least some of the raw annotated.

mscheltienne commented 2 years ago

So.. the dataset does not appear even on my account.. except if I explicitly enter the corresponding URL. Indeed, in the "share" tab you were not appearing anymore.. I've sent again the share invite. Let me know..

adam2392 commented 2 years ago

I unfortunately did not. Perhaps it just didn't finish uploading yet?

Openneuro even tho it's "nice" seems pretty buggy -__-

mscheltienne commented 2 years ago

Yep.. I had multiple issues with it recently. It did finish uploading for sure (I was very careful about that :p).. I'll upload it to dropbox or GoogleDrive this evening and I'll share it on your Gmail account.

adam2392 commented 2 years ago

Oh I see it now on openneuro :p

mscheltienne commented 2 years ago

Same it finally popped up on my account.. So just to be cleared the proc-raw are raw data as it comes out of the amplifier; and the proc-pp and proc-ica are what comes out of https://github.com/adam2392/improve_icalabel/tree/master/mnestudy/raw_to_bids All 3 files are actually generated from a non-BIDS compliment dataset with this script: https://github.com/adam2392/improve_icalabel/blob/master/mnestudy/raw_to_bids/ant.py

adam2392 commented 2 years ago

Agenda for tomorrow so I don't forget:

sharing of the preprocessed MARA dataset.
ICLabel existing dataset?
GUI
preprocessing workflow for ANTs

Outcome action items:

-> @anandsaini024 to share on Dropbox?
Let's not use the ICLabel dataset because its not very well-coded. We agreed to just use it for the port of ICLabel.
GUI @mscheltienne can possibly get this done by the time Adam comes back July 10thish.
So we want Aaron (@ayoun25) to do a double-checking of the proc-pp dataset for determining if the output for bad/good electrodes was done correctly by the automated pipeline. Note: if any of the datasets have more then like... 10 bridge electrodes, notify @mscheltienne.

chmendoza commented 1 year ago

Hello,

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

Is the preprocessed MARA dataset publicly available?
Is there any other dataset you would recommend that has more than two IC classes and has expert-annotated labels?
If there is no such dataset, which standard EEG datasets would you recommend to run ICLabel on it and use that as noisy labels to train my classifier?

Thank you!! :smiley:

jacobf18 commented 1 year ago

@chmendoza As far as I know, there is no large publicly available dataset for IC classification. We were working on processing a dataset (referenced above) to test the IC classification. The feature/label dataset for ICLabel is available, but not the original IC's. That dataset is available here: https://github.com/lucapton/ICLabel-Dataset.

adam2392 commented 1 year ago

Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).

@chmendoza Feel free to make a separate GH issue/PR, if/when your model is ready for review. We would love to include this into MNE-ICALabel to propagate it to the MNE community.

mne-tools / mne-icalabel

BIDSification of Winkler et al. dataset and other datasets for training/validation of ICA labeling models #8