It seems that there are some "" and "\x00\x00" in openwebtext corpus, I find that drop these can get better loss.
def remove_empty_strings(example):
if example['text']=='' or '\x00\x00' in example['text']:
return False
return True
processed_dataset = dataset.filter(remove_empty_strings)
It seems that there are some "" and "\x00\x00" in openwebtext corpus, I find that drop these can get better loss.