Open stefan-it opened 3 years ago
Thank you for the report, I'm looking into it!
TL; DR: It seems that the replacement characters (�) are present in the source we use to build OSCAR.
Details:
OSCAR is built using CommonCrawl data, which is available in three formats, including WARC and WET.
OSCAR is built using WET files.
While trying to pinpoint the code location where the bug could have happened, I got no Unicode conversion error from Ungoliant, while simultaneously getting replacement characters (�) in extracted data.
I also found replacement characters (�) in source files, hinting at a conversion problem on CommonCrawl tools side.
Looking into their source code I have found that the WET Writer converts text into UTF-8 (see WETExtractorOutput.java:L152) using String.getBytes(Charset charset), which replaces invalid characters by the replacement character (�).
It also seems that the previous OSCAR version suffered of the same problem, with cat tr_part_1.txt | grep "�"
returning 29 663 matches (for a file weighing 1.8G).
I'll continue to look for a solution that is compatible with our constraints, and thank you again for reporting the issue.
Hi, for what is worth, the previous version of oscar (the French instance) also contained conversion mismatch. Mix between utf8 latin1 and this java crap that was mentioned just above.
Hi guys,
after downloading and extracting the Turkish part of the OSCAR 21.09 release, I've found some sentences with encoding errors:
I did a
grep -c "�" tr_part_*
over the complete corpus, here are some stats:From
tr_part_1.txt
I took one example from line 369:I extracted the corresponding meta data line (hopefully right) from
tr_meta_part_1.jsonl
:As you can see on the actual page hyperlink the encoding is broken by default:
HTML content type header is:
However, if I manually switch my Chrome to use "Turkish (Windows-1254)" it's working: