Open zkokaja opened 1 year ago
With words surrounded with asterisks and dashes, which may cause problems for LLMs. Should we handle it differently? Fix the problem upstream?
df.explode('word') should set nans on onsets and offsets for duplicated values so we don't run them in encoding
df.explode('word')
nans
With words surrounded with asterisks and dashes, which may cause problems for LLMs. Should we handle it differently? Fix the problem upstream?