norahollenstein / zuco-benchmark

ZuCo Reading Task Classification Benchmark using EEG and Eye-Tracking Data
14 stars 6 forks source link

How to match preprocessed EEG data with words? #2

Open andreykramer opened 1 year ago

andreykramer commented 1 year ago

Hi, I want to work with data with high resolution in the time dimension. If I understand correctly each sensor's values are averaged for every word in the "Matlab files" folder of https://osf.io/2urht/ , so I'm trying to read the data from the Preprocessed folder.

According to https://osf.io/2urht/wiki/Data%20format/

Preprocessed

In the preprocessed folder is a folder for each subject. In these folders you can find: The preprocessed EEG data with Automagic (XX_EEG.mat). Please see the description of the preprocessing in the ZuCo paper for the details. The wordbounds (wordbounds_XX.mat), which are the coordinates of the word bounds for each presented word. The eye-tracking data (XX_ET.mat).

However it doesn't seem intuitive to me that there's only a (1,50) shaped array for word bounds to split the signals of all speakers, given that each speaker has a different size in the time dimension and that there's more than 50 words. So what would be the correct way to match EEG data to words in these files? Thank you in advance.

import h5py
from scipy.io import loadmat

if __name__ == "__main__":
    word_bounds = loadmat("../data/wordbounds_NR1.mat")
    print(word_bounds.keys())
    print(f"{word_bounds['wordbounds'].shape=}")
    print(f"{word_bounds['textbounds'].shape=}")
    preprocessed_data_YAC_NR1 = h5py.File("../data/gip_YAC_NR1_EEG.mat", 'r')
    preprocessed_data_YAG_NR1 = h5py.File("../data/gip_YAG_NR1_EEG.mat", 'r')
    print(f"{preprocessed_data_YAC_NR1['EEG']['data'].shape=}")
    print(f"{preprocessed_data_YAG_NR1['EEG']['data'].shape=}")
$ python3 read_zuco.py
dict_keys(['__header__', '__version__', '__globals__', 'wordbounds', 'textbounds'])
word_bounds['wordbounds'].shape=(1, 50)
word_bounds['textbounds'].shape=(1, 50)
preprocessed_data_YAC_NR1['EEG']['data'].shape=(225411, 105)
preprocessed_data_YAG_NR1['EEG']['data'].shape=(347416, 105)
samuki commented 1 year ago

Hi, thanks a lot for reaching out!

Yes, both word_bounds and text_bounds have a shape of (1,50) for the individual sentences. However, the array for word_bounds['wordbounds'] also contains arrays for each individual word in the sentence. You can iterate over arrays in word_bounds['wordbounds'] to get their shape, e.g.:

word_bounds = loadmat("wordbounds_NR1.mat")
print("Word bounds")
print(word_bounds['wordbounds'].shape)
print([word_bounds['wordbounds'][0][i].shape for i in range(word_bounds['wordbounds'].shape[1])])
print("Text bounds")
print(word_bounds['textbounds'].shape)
print([word_bounds['textbounds'][0][i].shape for i in range(word_bounds['textbounds'].shape[1])])

Output:

Word bounds
(1, 50)
[(25, 4), (17, 4), (15, 4), (27, 4), (23, 4), (37, 4), (25, 4), ...]
Text bounds
(1, 50)
[(1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (1, 4), ...]

This repository is only for the ZuCo Benchmark paper, where we only used sentence-level features.

To see how you can match EEG data to words, I can refer you to the reading-task-classification repository for the paper Reading Task Classification Using EEG and Eye-Tracking Data, where we conducted experiments with both sentence-level and word-level features. The code for extracting and matching the word-level features is here.

andreykramer commented 1 year ago

Hi! Thanks for the quick and detailed answer.

I was confusing the word bounds with something I could use to slice the data from the Preprocessed folder into words, but now I realize it represents the coordinates where the words were located on the screen while the participants were reading them.

From what I understand reading the paper and code you shared, both the word level and sentence level features you are loading there contain the already avaraged values of the 105 sensors.

I wanted to train a NN using data at the original sample rate, be it raw data at 500Hz or preprocessed with the steps described EEG Preprocessing, but it doesn't have to contain data averaged across the word or the sentence but the whole sequence captured by the EEG. From what I see the only place to find that are the files located under the Preprocessed and Raw Data folders, but for example the files from Preprocessed are not split into sentences or words: preprocessed_data_YAC_NR1['EEG']['data'].shape=(225411, 105) And it's still unclear to me how can I take those 225411 samples that I guess correspond to the reading of all the sentences in NR1 and split it into separate sentences I can feed into a model that uses textual labels. What would be the easiest way? I'm sorry if that's already done in the code you shared but from what I can tell it isn't.

samuki commented 1 year ago

Hi, I forwarded your question since I only worked with the extracted features for the ZuCo Benchmark paper. I'll get back to you once there is an answer.

TimeLordRaps commented 2 months ago

Any updates on this?

From reading this issue it seems like there is no simple way to group eeg data by its cooccurrence with word labels, only with their coordinates, which could be used to determine where the subject was looking on the task-materials.

If this is the case, would you just need to match the time sequences up with the word locations?

From looking at the task-materials, it isn't immediately obvious how to link the word locations with the words themselves.

That being said I'm not entirely sure what the first two columns of each of the task material csvs represents, can you shed some light on these and how one might go about linking the text with the readings.

Specifically it was mentioned in the README.md of the dataset itself on osf that "Sentences of the NR and TSR condition are merged and shuffled within each subject." How can this shuffling be determined?

samuki commented 1 month ago

Hi, sorry for the extended wait!

Regarding the matching process, I will forward the following explanation: The Zuco1 OSF repository contains a script (https://osf.io/gan9v) that helps to understand the matching process. First, the eye-tracking data (and the events of the eye-tracker) are merged into the EEG data. Then all “fixation” events of the eye tracker are extracted. For each fixation, the script checks whether it lies within the wordbounds of a word. If this is the case, the EEG is cut out from the beginning to the end of the fixation and assigned to the word.

Regarding the shuffling, only the sentences for the test subjects are merged and shuffled to participate in the benchmark.