hlp-ai / mt-data

MT Data
Apache License 2.0
1 stars 2 forks source link

From CommonCrawl WET files, for each web page count the lengths of texts of different languages #2

Open hlp-ai opened 1 year ago

hlp-ai commented 1 year ago

WET files contain text of wet page, and latest ones have languages info of text. However, the earlier WET files may have no language info, so we have to identify the languages of text in Web pages.