Just for references since I'am hand-comparing the languages lists of Lingualibre vs UNILEX.
I observed the following languages are not in UNILEX, possibly for various reasons. I'am conscious this issue is related corpuscrowler, but since I'am comparing the product of it (the lists), with in mind other list projects such as the Subtlex movement, it makes more senses to report this here. Sorted by order of apparition in LinguaLibre:List_of_languages.
ISO639-2
ISO639-3
Name
Sources to crawl
Notes
ca
cat
Catalan (Barcelona)
subtitles, ca.wikipedia.org
en
eng
English
subtitles
other solid sources exists.
eo
epo
Esperanto
eo.wikipedia.org
.
yue
Cantonese
zh-yue.wikipedia.org
Use cmn. Word segmentation required.
.
lnc
Languedocien
use cat ?
.
gsc
Gascon
use oci
af
afr
Afrikaans
af.wikipedia.org (100,000+)
.
ary
Moroccan Arabic
ary.wikipedia.org (3000+)
.
arz
Egyptian Arabic
arz.wikipedia.org (100,000+)
The number of articles is large for small view, may have lot of bot generated content from ar.
vi
vie
Vietnamese
vi.wikipedia.org
ko
kor
Korean
ko.wikipedia.org
Word segmentation required.
.
shy
Shawiya language
(no wikipedia )
sv
swe
Swedish
sv.wikipedia.org
.
atj
Atikamekw
atj.wikipedia.org
.
sat
Santali
sat.wikipedia.org
.
bcl
Central Bikol
bcl.wikipedia.org
.
arq
Algerian Arabic
Use ara.
th
tha
Thai
th.wikipedia.org
.
gaa
Ga
.
bbj
Ghomala' language
Note
Half of Lingualibre's language have been reviewed.
Open Subtitles 2018 languages : af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw
Note for later
[ ] convert lists into proper data for rapid comparison in jsfiddle or other.
Just for references since I'am hand-comparing the languages lists of Lingualibre vs UNILEX.
I observed the following languages are not in UNILEX, possibly for various reasons. I'am conscious this issue is related corpuscrowler, but since I'am comparing the product of it (the lists), with in mind other list projects such as the Subtlex movement, it makes more senses to report this here. Sorted by order of apparition in LinguaLibre:List_of_languages.
cmn
. Word segmentation required.cat
?oci
ar
.ara
.Note
Half of Lingualibre's language have been reviewed.
See also
iso639-3
andwikipedia prefix
. (smallest languages together on incubator wiki)wikipedia prefix
IETF BCP47
iso639-3
andIETF BCP47
.af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw
Note for later