3835 records full of backslashes

At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en

My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backslashes. It then happily sliced it into 0.5M-long records (which I deduce is its max doc length) and thus introduced thousands of records of just backslashes.

Checking that the original indeed contains these records:

Download the dataset (after pip install datasets)

python -c "from datasets import load_dataset; load_dataset('oscar', 'unshuffled_deduplicated_en', split='train', keep_in_memory=False, cache_dir='cache')"

Check the original records:

cd cache/downloads
find . -type f -size +50k | xargs -n1  gunzip -c | fgrep -a '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' | tee data-with-many-slashes.txt

Validate:

$ perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | wc -l
4245

Look at the lengths:

perl -lne 'm|(\\{10000,})| && print length $1' data-with-many-slashes.txt | sort -V

The largest number is 524287 (Which is the most common record)

oscar-project / corpus

3835 records full of backslashes #4