717 has malformed datums

hassonlab / 247-pickling

Contains code to create pickles from raw/processed data

1 stars 9 forks source link

Open zkokaja opened 1 year ago

zkokaja commented 1 year ago

With words surrounded with asterisks and dashes, which may cause problems for LLMs. Should we handle it differently? Fix the problem upstream?

hvgazula commented 1 year ago

zkokaja commented 1 year ago

df.explode('word') should set nans on onsets and offsets for duplicated values so we don't run them in encoding