Open adam2392 opened 2 years ago
For that matter, bidsification of as many datasets as possible, but organized semantically according to the criterion that was specified in https://github.com/jacobf18/iclabel-python/issues/5#issuecomment-1076431272 (i.e. different recording montages, hardware, etc.)
I can have a look at the bidsification of this dataset, as I also will have to do the ANT and EGI ones as well anyway. I never used mne-bids yet, but that was on my to-do list for a long time 😄
Some things to think about:
For the raw EEG datasets that can be used for benchmarking, I have:
It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.
Additional points to keep in mind for the processing of those datasets into labelled ICs:
(1, 100) Hz
? I think that's what is used to train ICLabel.For winkler et al. I looked quickly and the ICA decomposition seems to use: https://www.jmlr.org/papers/volume5/ziehe04a/ziehe04a.pdf
I'm not familiar with this algo, so I have to go over the paper to figure out what A_ffdiag
and W_ffdiag
represents.
Each file has the following attributes:
A_ffdiag (n x 30)
, W_ffdiag (30 x 3)
with n
around 120-ishcnt
: I think that's information about the raw datafilename
: oddball
is the only keyword I recognize in file names like 'oddball_fasor_B3_K2_VPii'
goodcomp
: either (1, k)
or (k, )
(with k
up to 30) containing the IDx of the good components. 0-index or 1-index?mnt
: dunnonComps
: dunno.. int from 30 to 810?We basically have 30 IC / file labeled.
But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.
It would be good to centralize those datasets, any idea where? The ANT Neuro one is small (less than 1 Gb) but the EGI one is above 30 Gb. I will crop the files during bidsification.
Perhaps for now, we can share via OneDrive, or Dropbox? I have access to OneDrive via institution still and can setup that if you guys don't have Dropbox pro?
Open to other ideas too.
I think long-term we want to store it on openneuro.org if that's okay? We might actually be able to leverage openneuro.org right away if you're okay with it. We can store and create private BIDSified datasets that we can then pull from even. I think programmatic access would require the dataset to be public(?)
But anyway, for bidsification, I looked into the specification derivatives for electrophysiological data, and the only reference I find is https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit#heading=h.f548zgpgxhiu so it's still at an extension proposal stage? I'll follow that for now, but if you know of another specification for ICA, I'm looking for it.
Yeah there is no agreed-upon spec yet for ICA, but I think we should just follow the "format" that is suggested for derivatives and ICA. E.g. filenaming and directory structure at the very minimum. We can store files in the ICA format output by MNE-Python.
Good idea for openneuro.org
, I'll check if we can make those datasets public. Else OneDrive is a good option.
Sounds reasonable for the ICA, let's go with FIFF then, I'll try to convert those .mat files.
Other datasets: https://github.com/agramfort/artifact-learn/tree/master/1-%20extract%20basic%20info%20from%20databases
looks like we can probably download online?
For more data and inspiration, we can look at this review paper: https://iopscience.iop.org/article/10.1088/1741-2560/12/3/031001/pdf
For references: mara -> BIDS: https://gist.github.com/mscheltienne/680f46336aec8a0408f30952c3d72e8d
ANT to BIDS: https://gist.github.com/mscheltienne/fe3dcc7dafef7539018a6a00ba73afed
See: https://github.com/adam2392/improve_icalabel for now scripts centralized into one repo.
Some open questions that we can defer to later of course:
inst
in order to apply the fitted ICA to get the estimated "sources". We need to probably hack an API for interfacing with the stored ICA time-series. Perhaps we just use them as RawArray
?@anandsaini024 do you have any existing code for building out benchmark models that you want to push up to the improve-icalabel
repo?
Anything that you are able to work on while we sort out the GUI and pipeline for annotating the data?
+1 for the IC time-series as RawArray, with an extension e.g. '*-sources-raw.fif'
for the MARA dataset.
In this convert function:
https://github.com/adam2392/improve_icalabel/blob/96522dacd045a5caa50f7f4653d9fd988a29bfa1/mnestudy/ica_to_bids/mara.py#L35-L44
For each iteration, the missing steps are to save the ICA ica
with an extension -ica.fif
, save the IC time-series sources
as a RawArray, save the good/bad components in a sidecar with the annotation function you added (all that at the correct BIDS Path).
@anandsaini024 This is what I briefly described to you this evening. It would be great if you could finish this conversion function.
Alright, I will pick this up.
@anandsaini024 have you preprocessed the ANTS dataset already? I am going to use one of the subjects as a test subject for the hs student to QA his annotations.
If you did, do you mind pushing up the script to mnestudy/ica_to_bids/ants.py
or somewhere there?
@adam2392 You should have received something from openneuro on your Gmail for dataset ds004178
. It contains the ANT Neuro raw files and the preprocessed (pp
) files with their ICA decomposition. It is however an automatic pipeline, so please have a look at the preprocessed data when you load it. Also, I did not check when I picked up those files, but maybe some of them are bad recordings with a lot of bridged electrodes (it does happen from time to time). If this is the case, then the recording has to be excluded and I can provide a different one instead.
Note: I deleted the old dataset with only raw data and replaced it with this one.. I did not figure out how to easily update the existing dataset 🤯
Can you share with me again?
Yeah updating is a pain. Adding files is easy, Deleting files is kind of a pain. Modifying files is a super pain.
Then my plan for the hs student is to:
It seems we might not be able to get him to fully annotate the ICA components as desired, but hopefully we can get at least some of the raw annotated.
So.. the dataset does not appear even on my account.. except if I explicitly enter the corresponding URL. Indeed, in the "share" tab you were not appearing anymore.. I've sent again the share invite. Let me know..
I unfortunately did not. Perhaps it just didn't finish uploading yet?
Openneuro even tho it's "nice" seems pretty buggy -__-
Yep.. I had multiple issues with it recently. It did finish uploading for sure (I was very careful about that :p).. I'll upload it to dropbox or GoogleDrive this evening and I'll share it on your Gmail account.
Oh I see it now on openneuro :p
Same it finally popped up on my account..
So just to be cleared the proc-raw
are raw data as it comes out of the amplifier; and the proc-pp
and proc-ica
are what comes out of https://github.com/adam2392/improve_icalabel/tree/master/mnestudy/raw_to_bids
All 3 files are actually generated from a non-BIDS compliment dataset with this script: https://github.com/adam2392/improve_icalabel/blob/master/mnestudy/raw_to_bids/ant.py
Agenda for tomorrow so I don't forget:
Outcome action items:
proc-pp
dataset for determining if the output for bad/good electrodes was done correctly by the automated pipeline. Note: if any of the datasets have more then like... 10 bridge electrodes, notify @mscheltienne.Hello,
Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).
Thank you!! :smiley:
@chmendoza As far as I know, there is no large publicly available dataset for IC classification. We were working on processing a dataset (referenced above) to test the IC classification. The feature/label dataset for ICLabel is available, but not the original IC's. That dataset is available here: https://github.com/lucapton/ICLabel-Dataset.
Thank you for mne-iclabel! I would like to test a new IC feature extracted from the time series in training a multi-class IC classifier (ideally, more than 2 types of ICs).
@chmendoza Feel free to make a separate GH issue/PR, if/when your model is ready for review. We would love to include this into MNE-ICALabel to propagate it to the MNE community.
The dataset referenced in https://github.com/agramfort/artifact-learn/issues/1#issuecomment-906141483 has good and bad components labeled after ICA.
To facilitate easy training/testing, it would be good to construct a BIDsification script to convert the dataset into BIDs format sometime using mne-bids.
cc: @jacobf18 @mscheltienne @anandsaini024