oscar-project / corpus

corpus issues.
Apache License 2.0
5 stars 0 forks source link

[BUG] Encoding errors in OSCAR 21.09 #2

Open stefan-it opened 2 years ago

stefan-it commented 2 years ago

Hi guys,

after downloading and extracting the Turkish part of the OSCAR 21.09 release, I've found some sentences with encoding errors:

image

I did a grep -c "�" tr_part_* over the complete corpus, here are some stats:

Filename Affected number of lines
tr_part_1.txt 1579
tr_part_2.txt 1575
tr_part_3.txt 1560
tr_part_4.txt 1603
tr_part_5.txt 1527
tr_part_6.txt 1674
tr_part_7.txt 1869
tr_part_8.txt 1628
tr_part_9.txt 1618
tr_part_10.txt 1656
tr_part_11.txt 1559
tr_part_12.txt 1739
tr_part_13.txt 1895
tr_part_14.txt 1598
tr_part_15.txt 1504
tr_part_16.txt 1549
tr_part_17.txt 1469
tr_part_18.txt 1424
tr_part_19.txt 1348
tr_part_20.txt 1200
tr_part_21.txt 1719
tr_part_22.txt 1364
tr_part_23.txt 1404
tr_part_24.txt 1565
tr_part_25.txt 1482
tr_part_26.txt 1689
tr_part_27.txt 1487
tr_part_28.txt 1539
tr_part_29.txt 1624
tr_part_30.txt 1444
tr_part_31.txt 1412
tr_part_32.txt 1530
tr_part_33.txt 1310
tr_part_34.txt 163

From tr_part_1.txt I took one example from line 369:

Sitemize �yelik ve i�eri�in indirilmesi tamamen �cretsizdir. Sitemizde payla��lan t�m dok�manlar (Tezler, makaleler, ders notlar�, s�nav soru cevaplar, projeler) payla��mc�lar�n bireysel �al��malar� olup telif haklar� kendilerine aittir ya da a��k bir �ekilde kamusal alana yerle�tirilmi� dok�manlar�n birer kopyalar�d�r. Ki�ilerin bireysel �al��malar�n� sitemizde y�klemesinde, sitemizde payla��ma te�vik eden puanlama sisteminin de etkisi b�y�kt�r. Bunlara ra�men hala size ait olan ve burada bulunmas�na izin vermedi�iniz dok�manlar varsa ileti�im b�l�m�nden y�neticilere bildirmeniz durumunda derhal silineceklerdir.

I extracted the corresponding meta data line (hopefully right) from tr_meta_part_1.jsonl:

{"headers":{"warc-type":"conversion","warc-record-id":"<urn:uuid:7426b39c-a6c9-4f21-b496-39e447af11fa>","content-type":"text/plain","warc-identified-content-language":"tur,eng","warc-date":"2021-03-09T03:48:37Z","warc-target-uri":"http://www.elektrotekno.com/forum-67.html","warc-refers-to":"<urn:uuid:e3e4a0d4-cff5-4c74-b6e4-788bb49cd27a>","warc-block-digest":"sha1:RMMGZX4322A5YTPZBEYMHADF6TDTYLVI","content-length":"3068"},"offset":368,"nb_sentences":1}

As you can see on the actual page hyperlink the encoding is broken by default:

image

HTML content type header is:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />

However, if I manually switch my Chrome to use "Turkish (Windows-1254)" it's working:

image

Uinelj commented 2 years ago

Thank you for the report, I'm looking into it!

Uinelj commented 2 years ago

TL; DR: It seems that the replacement characters (�) are present in the source we use to build OSCAR.

Details:

OSCAR is built using CommonCrawl data, which is available in three formats, including WARC and WET.

OSCAR is built using WET files.

While trying to pinpoint the code location where the bug could have happened, I got no Unicode conversion error from Ungoliant, while simultaneously getting replacement characters (�) in extracted data.

I also found replacement characters (�) in source files, hinting at a conversion problem on CommonCrawl tools side.

Looking into their source code I have found that the WET Writer converts text into UTF-8 (see WETExtractorOutput.java:L152) using String.getBytes(Charset charset), which replaces invalid characters by the replacement character (�).

It also seems that the previous OSCAR version suffered of the same problem, with cat tr_part_1.txt | grep "�" returning 29 663 matches (for a file weighing 1.8G).

I'll continue to look for a solution that is compatible with our constraints, and thank you again for reporting the issue.

dseddah commented 2 years ago

Hi, for what is worth, the previous version of oscar (the French instance) also contained conversion mismatch. Mix between utf8 latin1 and this java crap that was mentioned just above.