sccn / clean_rawdata

Cleaning Raw EEG data
GNU General Public License v3.0
43 stars 17 forks source link

Discrepancies between EEG.etc.clean_sample_mask, EEG.event boundaries, and clean file length (EEG.pnts) after using clean_rawdata and ASR #42

Closed empetrucci closed 1 year ago

empetrucci commented 2 years ago

Using the GUI in EEGLab ("Reject data using CleanRawdata and ASR"), I ran your plugin on my study, opting to "remove bad data periods (instead of correcting them)." I noticed the plugin prints the %data and number of seconds kept to the command window after rejection, but I am specifically interested in the %data kept between specific events in each file, so after all datasets were processed, I wrote a script to summarize this. However, after running this script, several of my files had >100% clean data, so I dug deeper and realized I had been making assumptions that were not true of the data structures. I want to make sure 1) that I understand the data structures accurately & that these assumptions are supposed to be true and 2) my solution to this discrepancy makes sense and does not harm the integrity of my dataset.

1) Are my assumptions supposed to be true? Am I misunderstanding anything? Assumption 1: The length of EEG.etc.clean_sample_mask = the length of the original, raw file (pre-rejection) Assumption 2: The sum of EEG.etc.clean_sample_mask = the length of the new, clean file (post-rejection) Assumption 3: I can divide the sum of EEG.etc.clean_sample_mask between 2 indices by the difference in those indices to find the %clean data in that period (e.g. sum(EEG.etc.clean_sample_mask(start_index:end_index) / (end_index - start_index) * 100) Assumption 4: I can also find the length of the new, clean file by subtracting the sum of all boundary durations in EEG.event from the length of the original file (e.g. length(EEG.etc.clean_sample_mask).

Given Assumptions 2 and 4, sum(EEG.etc.clean_sample_mask) = EEG.pnts = length(EEG.etc.clean_sample_mask) - total boundary duration (I used a for loop to sum the boundary durations from EEG.event). I wrote a script to tell me if Assumptions 2 and 4 were true for each dataset in my study by comparing each computation to EEG.pnts. For some datasets both were true, for some one assumption was true and one was false, and for some both were false.

2) My solution When Assumptions 2 and 4 are true, do not change anything. There is no error.

When Assumption 2 is true and 4 is false, there is an error in EEG.event regarding boundary latency and/or duration, therefore I should use the clean_sample_mask to overwrite the boundary latencies and duration (i.e. boundaries should start when there is a 0 following a 1 and end when there is and last until there is another 1).

When Assumption 4 is true and 2 is false, there is an error in EEG.etc.clean_sample_mask, therefore I should use EEG.event boundary information to overwrite the clean_sample_mask.

When Assumption 2 and 4 are both false... I'm not sure what to do about this yet. I think I'll have to look at these on a case-by-case basis because for some, Assumption 2 is false, but sum(EEG.etc.clean_sample_mask) is only 1 point greater than EEG.pnts, which seems to be a rounding difference, making Assumption 2 essentially true, but in other cases the values differ by much more and I have not yet determined the source of the discrepancies / what is supposed to be true.

I can upload code to demonstrate this, but I wanted to first make sure my understanding and logic is sound.

arnodelorme commented 2 years ago

Thank you for your message. It is quite long.

Assumption 1: The length of EEG.etc.clean_sample_mask = the length of the original, raw file (pre-rejection)

Yes

Assumption 2: The sum of EEG.etc.clean_sample_mask = the length of the new, clean file (post-rejection)

Yes

Assumption 3: I can divide the sum of EEG.etc.clean_sample_mask between 2 indices by the difference in those indices to find the %clean data in that period (e.g. sum(EEG.etc.clean_sample_mask(start_index:end_index) / (end_index - start_index) * 100)

I think so. Would need to check.

Assumption 4: I can also find the length of the new, clean file by subtracting the sum of all boundary durations in EEG.event from the length of the original file (e.g. length(EEG.etc.clean_sample_mask).

Correct.

Yes, please upload code and more importantly an example that shows the problem. Thank you so much.

empetrucci commented 2 years ago

Sorry the explanation was so long! Just wanted to make sure you had enough information.

Thank you for verifying my understanding of the data structures.

Here is a dropbox link for a few sample datasets, a couple data structures from the full study dataset, and the code I used to make one of the structures. Let me know if you have any questions or issues interacting with these.

Thank you so much!

empetrucci commented 2 years ago

Were you able to replicate my issue? If so, any idea how to resolve this? I am tempted to move forward, but I am not confident in the data's integrity because of these discrepancies in the data rejection ledgers.

arnodelorme commented 2 years ago

We have not. @dungscout96 would you mind to have a look?

arnodelorme commented 2 years ago

We are looking at it.

empetrucci commented 2 years ago

Thank you so much. Eager to hear if you could replicate and find the source of the problem.

On Mon, Nov 7, 2022 at 6:37 PM Arnaud Delorme @.***> wrote:

We are looking at it.

— Reply to this email directly, view it on GitHub https://github.com/sccn/clean_rawdata/issues/42#issuecomment-1306429360, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2K6O3A45S4JDBXLY4FJSH3WHGODFANCNFSM6AAAAAARHKCR2I . You are receiving this because you authored the thread.Message ID: @.***>

empetrucci commented 1 year ago

@dungscout96 Were you able to replicate the problem and/or determine the source of the discrepancy?

dungscout96 commented 1 year ago

I tested file sub-10_ses-3_task-gratitude_eeg.set and verified that total duration (assuming that unit is # of samples) of boundary events are not equal to the sum of 0s in EEG.etc.clean_sample_mask. @arnodelorme: can you verify how the duration of boundary events were calculated?

arnodelorme commented 1 year ago

This has been fixed in the current GitHub version. Can you check? Closing now, but feel free to reopen.