Open stas00 opened 2 years ago
Hi and thank you for the report.
Since goclassy (the pipeline used to generate OSCAR 2019) and its sequel ungoliant downloads and generates OSCAR from CommonCrawl dumps, it seems that the whole downloading and slicing of the 4B backslashes happened there.
I would like to have some precisions about the word "record", since it can mean many things in this context.
The issue itself may present itself again in the latest OSCAR 21.09, since the filtering is more or less the same.
We will look into what can be done to improve detection of such low-quality content.
At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en
My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.
Checking that the original indeed contains these records:
Download the dataset (after
pip install datasets
)Check the original records:
Validate:
Look at the lengths:
The largest number is
524287
(Which is the most common record)