Closed VeritasJoker closed 1 year ago
Can you please elaborate?
Oh since I'm testing different ways to choose tokens to use in encoding, I need the token_idx to be correct (as in representing the token index for each token inside a specific word). The screenshot shows the datum for 625 loaded in encoding, the token_idx doesn't seem to be correct as there are values going as high as 73. So I will check the pickling code to see where this is generated (probably tokenize and explode?)
Can you try an older commit and check this?
Oh I'm trying your newest commit in dev, which has a new line for the token_idx generation
Ok update, I'm revising this line to df["token_idx"] = (df.groupby(["adjusted_onset","word"]).cumcount()).astype(int)
as df.index gets changed during explode