oscar-project / corpus

corpus issues.
Apache License 2.0
5 stars 0 forks source link

3835 records full of backslashes #4

Open stas00 opened 2 years ago

stas00 commented 2 years ago

At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en

My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.

Checking that the original indeed contains these records:

$ perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | wc -l
4245

Look at the lengths:

perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | sort -V

The largest number is 524287 (Which is the most common record)

Uinelj commented 2 years ago

Hi and thank you for the report.

Since goclassy (the pipeline used to generate OSCAR 2019) and its sequel ungoliant downloads and generates OSCAR from CommonCrawl dumps, it seems that the whole downloading and slicing of the 4B backslashes happened there.

I would like to have some precisions about the word "record", since it can mean many things in this context.

The issue itself may present itself again in the latest OSCAR 21.09, since the filtering is more or less the same.

We will look into what can be done to improve detection of such low-quality content.