hassonlab / 247-encoding

Contains python scripts for performing encoding on 247 data.
0 stars 9 forks source link

Duplicate words with same adjusted_onset #40

Closed VeritasJoker closed 1 year ago

VeritasJoker commented 2 years ago

I was sending a csv to Bobbi to run encoding in MatLab but we found that there are duplicate words with the same 'word' and 'adjusted_onset' columns in the datum. Here is an example with gpt2-xl datum for 676 (align with gpt2-xl only)

Screen Shot 2022-09-08 at 5 54 42 PM

This is the datum at this point in the code: https://github.com/hassonlab/247-encoding/blob/a1fa82850e3c773f82b721d43f773d0d45f68663/code/tfsenc_read_datum.py#L356

We have done all the processing and filtering inside encoding, meaning filtering based on in_gpt2 (also gpt2-xl_token_is_root since they are the same) columns.

I'm not exactly sure what is going on here. If it's a word repeated twice then the onsets should be different. If not, then the sentence might have duplicated words? Will investigate more.

VeritasJoker commented 2 years ago
Screen Shot 2022-09-08 at 10 28 51 PM

Ok, so the issue is with the original data itself. We will have to go fix that in the original data and rerun pickling, embedding, encoding, and decoding : )

VeritasJoker commented 1 year ago

This should be fixed inside pickling and can be closed