norahollenstein / zuco-benchmark

ZuCo Reading Task Classification Benchmark using EEG and Eye-Tracking Data
14 stars 5 forks source link

Data formatting #4

Closed hamza13-12 closed 1 month ago

hamza13-12 commented 4 months ago

Hi. I want to download the dataset and set it up nicely into a dataframe where one column corresponds to eeg signals and the other column corressponds to the sentence. So that way, I can format it into a EEG-Text based dataset and use seq 2 seq models to generate text given an EEG signal. Please provide me with the basic steps to set up your data in this manner

samuki commented 4 months ago

Hi, the first step would be to download the whole dataset, using get_data.sh. You could then take a look at the (sentence-level) feature extraction function, which will be called if you run benchmark_baseline.py: https://github.com/norahollenstein/zuco-benchmark/blob/5ad276d2d075a30e8a47488cff082df435870ce3/src/extract_features.py#L95

From that function, you can save the sentence line 117 and EEG-features line 180.

Note that these are sentence-level features. If you want to extract word-level features, you could follow the steps for word-level classification outlined in the following repository and save the data from the corresponding feature extraction functions: https://github.com/norahollenstein/reading-task-classification?tab=readme-ov-file#classification-with-word-level-features

hamza13-12 commented 4 months ago

Hi, after downloading the entire dataset, I got the following error on running the benchmark script:

File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\extract_features.py", line 117, in extract_sentence_features sent = dlh.load_matlab_string(f[obj_reference_content]) #save the sentence line 117 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\data_loading_helpers.py", line 80, in load_matlab_string extracted_string = u''.join(chr(c) for c in matlab_extracted_object) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\data_loading_helpers.py", line 80, in extracted_string = u''.join(chr(c) for c in matlab_extracted_object) ^^^^^^ TypeError: only integer scalar arrays can be converted to a scalar index

I resolved this by changing the matlab_load_string like:

def load_matlab_string(matlab_extracted_object): """ Converts a string loaded from h5py into a python string. Handles both the scenarios where the matlab_extracted_object is a bytes object directly (h5py 3.x) or an array of unicode characters (h5py 2.x). :param matlab_extracted_object: (h5py) matlab string object or bytes :return: extracted_string (str) translated string """ if isinstance(matlab_extracted_object, bytes):

If the object is a bytes object, decode it directly to a string

    extracted_string = matlab_extracted_object.decode('utf-8')
else:
    # Otherwise, handle it as an array of integers representing characters
    extracted_string = ''.join(chr(c[0]) for c in matlab_extracted_object)

return extracted_string

hamza13-12 commented 4 months ago

Also, I am still struggling to set EEG data and corressponding sentences up into a dataframe. Could you provide me with precise modifications to make to your existing code to achieve a dataframe? It really would streamline the whole process for me

samuki commented 4 months ago

Hi, after downloading the entire dataset, I got the following error on running the benchmark script:

File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\extract_features.py", line 117, in extract_sentence_features sent = dlh.load_matlab_string(f[obj_reference_content]) #save the sentence line 117 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\data_loading_helpers.py", line 80, in load_matlab_string extracted_string = u''.join(chr(c) for c in matlab_extracted_object) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Hamza\Desktop\ZuCo dataset\zuco-benchmark\src\data_loading_helpers.py", line 80, in extracted_string = u''.join(chr(c) for c in matlab_extracted_object) ^^^^^^ TypeError: only integer scalar arrays can be converted to a scalar index

I resolved this by changing the matlab_load_string like:

def load_matlab_string(matlab_extracted_object): """ Converts a string loaded from h5py into a python string :param matlab_extracted_object: (h5py) matlab string object :return: extracted_string (str) translated string """ extracted_string = ''.join(chr(int(c)) for c in matlab_extracted_object) return extracted_string

Hi, this is probably an issue with your h5py version. This error should not occur with version 2.9.0, but it's good that you found a fix.

samuki commented 4 months ago

Also, I am still struggling to set EEG data and corressponding sentences up into a dataframe. Could you provide me with precise modifications to make to your existing code to achieve a dataframe? It really would streamline the whole process for me

Sure, here is an example:
You could for each subject create a dataframe with the sentences and your target EEG features (e.g. t_mean) by initializing the data df_data = {"sent":[], "t_mean":[]} at the beginning of the function after line 96, then append the data during the loop, e.g. in line 211

df_data["sent"].append(sent)
df_data["t_mean"].append(t_mean)

and then in the end save the subject features:

df = pd.DataFrame(df_data)
df.to_csv(f"{subject}_t_means.csv")
hamza13-12 commented 4 months ago

Thank you! This is much appreciated