Check if token_idx column is generated correctly

hassonlab / 247-pickling

Contains code to create pickles from raw/processed data

1 stars 9 forks source link

Check if token_idx column is generated correctly #141

Closed VeritasJoker closed 1 year ago

VeritasJoker commented 1 year ago

Screen Shot 2023-01-11 at 2 47 10 PM Screen Shot 2023-01-11 at 2 48 28 PM

hvgazula commented 1 year ago

Can you please elaborate?

VeritasJoker commented 1 year ago

Oh since I'm testing different ways to choose tokens to use in encoding, I need the token_idx to be correct (as in representing the token index for each token inside a specific word). The screenshot shows the datum for 625 loaded in encoding, the token_idx doesn't seem to be correct as there are values going as high as 73. So I will check the pickling code to see where this is generated (probably tokenize and explode?)

hvgazula commented 1 year ago

Can you try an older commit and check this?

VeritasJoker commented 1 year ago

Oh I'm trying your newest commit in dev, which has a new line for the token_idx generation

VeritasJoker commented 1 year ago

Ok update, I'm revising this line to df["token_idx"] = (df.groupby(["adjusted_onset","word"]).cumcount()).astype(int) as df.index gets changed during explode