[BUG] Encoding errors in OSCAR 21.09

stefan-it commented 3 years ago

Hi guys,

after downloading and extracting the Turkish part of the OSCAR 21.09 release, I've found some sentences with encoding errors:

I did a grep -c "�" tr_part_* over the complete corpus, here are some stats:

Filename	Affected number of lines
tr_part_1.txt	1579
tr_part_2.txt	1575
tr_part_3.txt	1560
tr_part_4.txt	1603
tr_part_5.txt	1527
tr_part_6.txt	1674
tr_part_7.txt	1869
tr_part_8.txt	1628
tr_part_9.txt	1618
tr_part_10.txt	1656
tr_part_11.txt	1559
tr_part_12.txt	1739
tr_part_13.txt	1895
tr_part_14.txt	1598
tr_part_15.txt	1504
tr_part_16.txt	1549
tr_part_17.txt	1469
tr_part_18.txt	1424
tr_part_19.txt	1348
tr_part_20.txt	1200
tr_part_21.txt	1719
tr_part_22.txt	1364
tr_part_23.txt	1404
tr_part_24.txt	1565
tr_part_25.txt	1482
tr_part_26.txt	1689
tr_part_27.txt	1487
tr_part_28.txt	1539
tr_part_29.txt	1624
tr_part_30.txt	1444
tr_part_31.txt	1412
tr_part_32.txt	1530
tr_part_33.txt	1310
tr_part_34.txt	163

From tr_part_1.txt I took one example from line 369:

Sitemize �yelik ve i�eri�in indirilmesi tamamen �cretsizdir. Sitemizde payla��lan t�m dok�manlar (Tezler, makaleler, ders notlar�, s�nav soru cevaplar, projeler) payla��mc�lar�n bireysel �al��malar� olup telif haklar� kendilerine aittir ya da a��k bir �ekilde kamusal alana yerle�tirilmi� dok�manlar�n birer kopyalar�d�r. Ki�ilerin bireysel �al��malar�n� sitemizde y�klemesinde, sitemizde payla��ma te�vik eden puanlama sisteminin de etkisi b�y�kt�r. Bunlara ra�men hala size ait olan ve burada bulunmas�na izin vermedi�iniz dok�manlar varsa ileti�im b�l�m�nden y�neticilere bildirmeniz durumunda derhal silineceklerdir.

I extracted the corresponding meta data line (hopefully right) from tr_meta_part_1.jsonl:

{"headers":{"warc-type":"conversion","warc-record-id":"<urn:uuid:7426b39c-a6c9-4f21-b496-39e447af11fa>","content-type":"text/plain","warc-identified-content-language":"tur,eng","warc-date":"2021-03-09T03:48:37Z","warc-target-uri":"http://www.elektrotekno.com/forum-67.html","warc-refers-to":"<urn:uuid:e3e4a0d4-cff5-4c74-b6e4-788bb49cd27a>","warc-block-digest":"sha1:RMMGZX4322A5YTPZBEYMHADF6TDTYLVI","content-length":"3068"},"offset":368,"nb_sentences":1}

As you can see on the actual page hyperlink the encoding is broken by default:

HTML content type header is:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />

However, if I manually switch my Chrome to use "Turkish (Windows-1254)" it's working:

Uinelj commented 3 years ago

Thank you for the report, I'm looking into it!

Uinelj commented 3 years ago

TL; DR: It seems that the replacement characters (�) are present in the source we use to build OSCAR.

Details:

OSCAR is built using CommonCrawl data, which is available in three formats, including WARC and WET.

WARC contains the complete HTML source for each document.
WET contains textual content only.

OSCAR is built using WET files.

While trying to pinpoint the code location where the bug could have happened, I got no Unicode conversion error from Ungoliant, while simultaneously getting replacement characters (�) in extracted data.

I also found replacement characters (�) in source files, hinting at a conversion problem on CommonCrawl tools side.

Looking into their source code I have found that the WET Writer converts text into UTF-8 (see WETExtractorOutput.java:L152) using String.getBytes(Charset charset), which replaces invalid characters by the replacement character (�).

It also seems that the previous OSCAR version suffered of the same problem, with cat tr_part_1.txt | grep "�" returning 29 663 matches (for a file weighing 1.8G).

I'll continue to look for a solution that is compatible with our constraints, and thank you again for reporting the issue.

dseddah commented 3 years ago

Hi, for what is worth, the previous version of oscar (the French instance) also contained conversion mismatch. Mix between utf8 latin1 and this java crap that was mentioned just above.

oscar-project / corpus

[BUG] Encoding errors in OSCAR 21.09 #2