google / corpuscrawler

Crawler for linguistic corpora
Other
189 stars 56 forks source link

Add Wikipedia crawler ? (300+ languages) #78

Open hugolpz opened 3 years ago

hugolpz commented 3 years ago

A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).

Assess interest

  1. Assess how many Wikipedia languages are not in UNILEX. See https://github.com/unicode-org/unilex/issues/14 .
  2. Assess quality of wikipedia raw text data in minority languages.
  3. Compare gain to other available public corpora such Tatoeba (358 languages).

Crawling via API

By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to max=n articles.

Given an iso code such as Ndonga's ng :

Wikipedia API provides text

Various formats available:

List of Wikipedia (~300)

List of articles per Wikipedia

For convenience, I use the tiny Ndonga (ng) Wikipedia (8 articles), easier to explore by hand.

For larger demo, you could also inspect similar URLs with the iso of :

Language Native iso Articles
Ndonga Oshiwambo ng 8
Inuktitut ᐃᓄᒃᑎᑐᑦ/inuktitut iu 514
Samoan Gagana Samoa sm 985
Igbo Igbo ig 2,085
Central Bikol Bikol Central bcl 10,824

Namespaces

On all wikis. See also here

Dumps' & paths

Using Wikipedia extractors ?

Hybrid approach

cc: @brawer

hugolpz commented 3 years ago

Discussion engaged with the Wikimedia Foundation's Dumb-Generation project. See : phabricator.wikimedia.org > T276723

hugolpz commented 3 years ago

Python processing :

Wikicompiler is a fully extensible python library that compile and evaluate text from Wikipedia dump. You can extract text, do text analysis or even evaluate the AST(Abstract Syntax Tree) yourself. Topics: python, compiler, mediawiki, wikipedia, wikitext, wikipedia-dump, wikitext-parser.

One presentation is in italian but has some interesting nugets: here. The gist:

Screenshot_2021-03-27_17-59-39 Screenshot_2021-03-27_18-14-13

GTOqaz commented 1 year ago

@hugolpz That's impressive.

hugolpz commented 1 year ago

@GTOqaz there are some upcoming Google crawling on 2000 languages, I hope they will make some data available, especially frequency lists.

hugolpz commented 6 months ago

There are ready-to-download open licence Wikipedia corpora available.

Project introduction Type Languages (2024) Portal all Language specific Download link Comments
OpenSubtitles 2016/2018
Subtitles
Parallel sentences
Monolingual sentences
75 Portal br&en bre (mono) '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
Wortschatz by Leipzig Sentences
Monolingual
290+ - bre bre 100k sentences (2021) List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora
CC-100 Sentences
Monolingual
115 Portal n.a. br (mono) « No claims of intellectual property are made on the work of preparation of the corpus. »