hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 9 forks source link

717 has malformed datums #142

Open zkokaja opened 1 year ago

zkokaja commented 1 year ago

With words surrounded with asterisks and dashes, which may cause problems for LLMs. Should we handle it differently? Fix the problem upstream?

hvgazula commented 1 year ago
image
zkokaja commented 1 year ago

df.explode('word') should set nans on onsets and offsets for duplicated values so we don't run them in encoding