microsoft / PhysioPro

A deep learning framework for physiological data processing and understanding.
MIT License
67 stars 12 forks source link

Regarding feature used of SEED in NIPS23 paper #16

Closed sylyoung closed 6 months ago

sylyoung commented 7 months ago

For the results reported in the paper [Learning Topology-Agnostic EEG Representations with Geometry-Aware Modeling], the results are exceedingly high. Did you use the moving averaged features provided by SEED datasets? If so, the subject-dependent experiment becomes very unreasonable, because such methods generally assumes the full trial is of the same class, basically data leakage problem as denoted in The perils and pitfalls of block design for EEG classification experiments.

victorywys commented 7 months ago

Many thanks for your interests in our paper. Sorry for the late reply because it took us some time to check the material you provided.

Choice of dataset and fair comparison with baselines.

We choose to use SEED as the dataset for our experiment because 1) SEED has a very different channel configuration with the large pretraining dataset (TUEG) while being reasonably large as a downstream task. 2) SEED seems to be widely used by previous studies thus it's convenient to make comparison. 3) SEED provides a DE feature officially which can reduce the burden of temporal modeling while focusing on transferring across channel configurations.

We indeed use the DE feature provided officially in our experiments. You can find the preprocess code here which is merely a transformation of the data format. This setting is shared across all the baselines used in our experiments to ensure fair comparison.

Concerning the SEED "Moving Average Problem"

Upon careful examination of the study "The Perils and Pitfalls of Block Design for EEG Classification Experiments" (hereinafter referred to as the "Question Paper" or QP) you provide, we have identified a primary concern related to the block design methodology utilized in the referenced study, "Deep Learning Human Mind for Automated Visual Classification" (hereinafter referred to as the "Origin Paper" or OP).

In the Origin Paper, the authors implemented an experimental protocol wherein 40 object classes from the ImageNet database, with 50 images per class, were presented as visual stimuli to six participants during EEG recordings. A block design was used, whereby each participant was exposed to 40 blocks of stimuli, each block comprising 50 images from a single object class, ensuring each image was presented exactly once. Consequently, all participants viewed the same set of 2,000 images in total. (Adapted from the Question Paper)

The Question Paper subsequently proposed a series of experiments to challenge the Origin Paper's findings, suggesting that the reported high accuracy might be attributable to the model "learning arbitrary long-term static mental states during a block rather than dynamic brain activity." This assertion was based on the observation that trials sharing the same label were derived from identical blocks. Hence, using moving average is making learning long-term static mental states easier, which is unreasonable for OP.

However, in SEED experiment, each subject takes 15 trails. One trial contains a movie clip with a label in one of three [netural, positive, negative]. The order of trails is near random. The training/test split is 9/6. One training sample and one validation sample with the same labels come from different trials, Hence we are not learning features for static mental states within one trial. Though it conducts moving average within one trial, it tends to eliminate the noise but not extract features for static mental states. And to the best of our understanding, information in the test set will not be exposed during training.

We hope we correctly understand and properly addressed your concerns. We welcome constructive feedback and are open to adjustments based on further insights and discussions, thus we leave this issue open for a while in case you would like to provide more details about the concern or have further questions.

sylyoung commented 7 months ago

Thank you for your reply and I highly appreciate your time in looking into it!

I think your argument on the "Pitfall" problem was correct. As long as your experiment protocols follow "One training sample and one validation sample with the same labels come from different trials," there is no "Pitfall" problem here.

However, as I looked into the provided preprocessing code, I realized that you are using the "de_LDS" features, which refers to the differential entropy feature with Linear Dynamical System from the paper "Off-line and on-line vigilance estimation based on linear dynamical system and manifold learning" [EMBC 2010]. I have not fully understood how the LDS did the normalization/smoothing, but I highly doubt there are issues with the normalization techniques. I also tried extracting DE features from raw signals of the SEED dataset myself instead of using the provided normalized features, the performance dropped to very low scores. If you check out the many other papers on EEG emotion classification for datasets without such normalization techniques, e.g., DEAP, MAHNOB, the binary accuracies should be around 60%, e.g., "TSception: Capturing Temporal Dynamics and Spatial Asymmetry From EEG for Emotion Recognition" [TAC 2023], very far from the over 95% on SEED. For competitions/contests that highlight online inference, the aforementioned normalization techniques generally do not hold, and the performance also dropped sharply from offline to online testing, e.g., "Comparison of cross-subject EEG emotion recognition algorithms in the BCI Controlled Robot Contest in World Robot Contest 2021".

In the end, I am not doubting your results nor your proposed method in the paper, but I highly suggest that you fully look into the data preprocessing steps, instead of using the provided features, and reexamine if 99% accuracy is possible, considering the difficulty of EEG emotion classification task. Intriguingly, one side of papers claim > 95% while the other side claims only barely above random guess. There must be one side that is doing wrong.

victorywys commented 7 months ago

Thank you very much for exposing this issue and let us be aware of it. It sounds possible that there may be one side doing wrong, but it also sounds possible that LDS is a crucial yet legal part of EEG preprocessing as long as it won't violate any machine learning principles. Note that some studies (e.g. https://ieeexplore.ieee.org/document/9204431/) also claimed an over 90% accuracy on binary classification tasks like DEAP and DREAMER.

We are sorry that we can't draw a conclusion yet before we gain a deeper understanding. However, on the one hand, we'll continue investigating this problem. On the other hand, we're planning to test MMM on more datasets besides SEED, as well as to enable it upon raw EEG signals. We'll keep this repo updated.

If you have further questions, please let us know. I'll leave this issue open for a few more days.