Closed slavpetrov closed 8 years ago
I support this suggestion if feasible.
We thought about word embeddings as well -- that is why the raw corpora are part of the available resources (and any other texts are disallowed).
The problem with precomputed word embeddings is that it is not obvious which should be used. The Skip-Gram via Negative Sampling embeddings have been used for quite some time and seem to work well. But recently Structured Skip-Gram embeddings are being used too (wang2vec can compute them, for example), someone is using GloVe, and people might also use some multi-language embeddings (i.e., projecting words of different languages to the same space). Similar issue is with the dimension -- we are talking about tuning on ~40 corpora, so providing word embeddings with too large dimension will require a lot of computation power. Therefore, we decided to give the participants only the raw texts and left the word embedding computation to them -- any advantage gained in word embeddings is through the method, not through the amount of data.
I added an explicit note to the proposal that word embeddings can be computed only on the given raw texts.
However, if you think that we should also provide precomputed word embeddings, we can easily do so (we can compute them without any problems). If you think so, which method/dimension would you think we should use? What about something like Skip-gram with negative sampling, dimension 100 [the simension is the as in original Stack-LSTM paper]?
How big are the raw corpora that are going to be made available? If they are sufficiently large to train good word embeddings on them, then everybody can use their preferred method and we don't need to supply pretrained embeddings...
The corpora we currently have are not that big (I should say quite small from the perspective of word embeddings, see https://ufal.mff.cuni.cz/~majlis/w2c/download.html for sizes).
However, there was a project to gather much larger corpora (~B tokens) for UD languages.
@fginter, are you still working on the "gather corpora for all UD languages" project? Do you think that the corpora could be used in the proposed UD end-to-end parsing CoNLL-17 shared task? The hard deadline for the data to be available would be March 2017. (BTW, we are only proposing the shared task at this moment.)
Hi
Yes, this is in the works. Our current pipeline uses CommonCrawl data and the CLD2 language recognition library. CLD2 was chosen for its speed. Currently we have a complete run of the latest CommonCrawl with CLD2 done, so we can gather stats of which UD languages are covered and to what extent. @mjluot runs the pipeline and can gather the stats.
It is clear that some of the more exotic UD languages will not be covered by CLD2, and many of these won't have that much data on the Web to begin with anyway.
On a more general note: there are people who will have corpora for some of the more rare languages. Maybe we could make a "call for resources" prior to the shared task, where people could contribute those they'd like to use, making them officially available also to others.
-Filip
Our (small) W2C corpus is also web-based and contains all UD 1.2 languages, except for Gothic, Old Church Slavonic, and Ancient Greek, so the language coverage on web is reasonably good.
As for the "call for resources" -- that is definitely a good idea. The proposal should probably contain only the data which we can ourselves deliver, but if it gets accepted, we could ask people to coin in.
So for the time being, I am interested in whether we can gather reasonably large (~gigaword) corpora for most (>=50%) of the languages -- if so, we can formulate the proposal in a way that we will provide the raw data and that the word embeddings can be computed only from those data. Then, hopefully other people could help us by providing the data for the languages we could not get from CommonCrawl.
@mjluot will give you an answer shortly for how many UD languages we may be able to get ~gigaword corpora.
All this sounds very good, but I would still like to make a plea for providing "baseline" word embeddings as well as the raw text data. One of the hallmarks of the CoNLL shared tasks have always been to make it easy for a wide range of researchers to participate. I fear that having to compute the embeddings for all these languages from scratch (even given the data) will be too high a threshold for some of the less advanced groups (and their results will suffer as a result). The more advanced groups will probably want to use their own favorite embeddings to get an extra edge, so providing both the text data and the embeddings, although in some sense superfluous, will facilitate broad participation. Of course, all this has to be weighed against the extra burden put on the organisers.
:+1: Running word2vec (or any of its sufficiently fast alikes) with some reasonable default parameters on this data will not be hard.
Yes, we will also provide baseline word embeddings.
From my point, we are trying to decide between:
Hopefully we will be able to go with the first option. I have updated the proposal to reflect this.
I'm working on some preliminary stats on common crawl data but hit a small delay. Will get them soon.
Sounds good. BTW, we recently released CLD3, an updated version of CLD2: https://github.com/google/cld3
We should decide
Additional questions:
Just a note on multi-lingual embeddings: the main question is whether we will provide some parallel data (at least small as a seed/dictionary, each language with English, it does not have to overlap with the treebank data). Without such data it is very difficult for the participants to compute multi-lingual embeddings themselves.
We don't provide parallel data. We don't have it for all languages, it is not always easy to obtain, and we don't want to exclude a language just because it does not have parallel data.
We should not exclude a language because it does not have parallel data. Some languages are better resourced than others, and that's fine. But I don't agree we should not provide parallel data or dictionaries or something which would let us bind the languages lexically. I think it is not a realistic setting, forcing only delexicalized methods, and I think it could be very exciting to see some actual lexicalized transfer happening in the shared task. Anyone is still free and welcome to default to a single language mode in their submissions, but I think we should support those who truly want to aim for lexicalized transfer.
So, I ran language recognizer (the cld2 because of its nice python bindings) over 6 of these common crawl crawl-sets. There are altogether 20 of them. Also ran a very simple, naive, deduplication based on payload hash and urls. The final product, with better deduplication, would have less tokens, maybe even half. The first, 2012 crawl is not included in these crawls, and it should be largest. The surprisingly low token count of chinese is explained by white space tokenization.
What can be said about this? I suppose it's fairly safe to say common crawl data would provide billion token range data for the largest amount of languages.
6 crawls
all_docs 6,009,146,472 uniq_docs 2,298,220,862 uniq_urls 1,771,487,131
Language | URL_dedup_tokens | Doc_dedup_tokens | all_tokens |
---|---|---|---|
ENGLISH | 1,023,587,042,618 | 1,677,367,510,883 | 3,526,675,699,180 |
SPANISH | 16,189,502,598 | 34,289,896,881 | 63,946,174,729 |
FRENCH | 11,087,090,556 | 21,428,125,601 | 40,592,352,282 |
GERMAN | 8,172,422,225 | 16,695,836,957 | 31,054,854,843 |
PORTUGUESE | 6,593,084,831 | 15,215,699,728 | 26,825,085,303 |
INDONESIAN | 4,430,657,740 | 9,160,371,849 | 16,990,145,585 |
ITALIAN | 3,593,687,086 | 7,358,890,918 | 14,158,495,943 |
VIETNAMESE | 2,927,943,557 | 7,185,918,017 | 11,852,237,633 |
POLISH | 2,822,902,691 | 6,665,657,235 | 11,713,111,394 |
TURKISH | 2,512,646,647 | 5,869,505,132 | 10,066,737,245 |
MALAY | 1,463,837,283 | 3,899,122,319 | 6,590,447,461 |
DUTCH | 1,778,600,632 | 3,637,081,424 | 6,562,328,786 |
PERSIAN | 1,547,816,326 | 3,476,375,556 | 6,525,711,329 |
RUSSIAN | 1,490,584,813 | 3,417,309,389 | 5,879,070,412 |
ARABIC | 1,415,924,103 | 3,107,635,883 | 5,458,046,426 |
SWEDISH | 1,295,428,057 | 2,857,131,419 | 5,024,027,916 |
ROMANIAN | 993,157,281 | 2,468,215,333 | 4,221,697,965 |
DANISH | 920,113,934 | 1,855,634,725 | 3,327,735,988 |
GREEK | 832,142,246 | 1,812,193,771 | 3,042,777,999 |
Japanese | 674,098,452 | 1,418,113,948 | 2,497,732,129 |
HUNGARIAN | 635,528,977 | 1,265,579,426 | 2,353,087,104 |
NORWEGIAN | 655,426,515 | 1,235,519,646 | 2,304,746,446 |
CZECH | 543,504,852 | 1,211,681,792 | 2,171,958,288 |
THAI | 569,657,722 | 1,214,394,268 | 2,128,109,874 |
Korean | 593,515,036 | 1,089,819,776 | 1,857,861,812 |
FINNISH | 402,311,440 | 892,884,513 | 1,666,577,230 |
CROATIAN | 316,928,672 | 728,650,346 | 1,361,129,515 |
ChineseT | 363,401,223 | 790,161,390 | 1,321,295,490 |
HEBREW | 300,918,111 | 748,869,522 | 1,260,833,512 |
SERBIAN | 285,582,740 | 679,099,543 | 1,218,452,896 |
Chinese | 317,795,368 | 706,205,585 | 1,205,332,261 |
CATALAN | 274,061,869 | 468,498,516 | 964,881,957 |
SLOVAK | 221,595,288 | 522,150,294 | 910,951,683 |
BULGARIAN | 179,716,234 | 403,235,289 | 735,670,548 |
LATIN | 236,923,977 | 228,414,965 | 701,559,746 |
SLOVENIAN | 192,844,991 | 318,985,990 | 658,547,084 |
LITHUANIAN | 153,997,785 | 325,959,450 | 586,431,636 |
UKRAINIAN | 147,190,310 | 306,086,561 | 576,959,192 |
ALBANIAN | 126,806,325 | 292,637,296 | 530,843,270 |
LATVIAN | 110,076,820 | 284,247,190 | 492,956,121 |
ESTONIAN | 107,684,809 | 257,102,186 | 441,159,792 |
GALICIAN | 122,783,063 | 222,267,183 | 413,174,643 |
HINDI | 100,144,675 | 237,366,680 | 410,405,979 |
BOSNIAN | 82,725,643 | 170,657,852 | 336,884,587 |
TAMIL | 63,337,522 | 149,605,986 | 261,001,034 |
GEORGIAN | 67,323,770 | 151,769,031 | 258,893,994 |
ICELANDIC | 58,779,458 | 121,762,169 | 223,594,013 |
BASQUE | 70,805,720 | 113,421,567 | 217,033,427 |
TAGALOG | 56,632,156 | 98,790,374 | 212,354,568 |
ARMENIAN | 35,145,729 | 86,541,608 | 147,074,685 |
URDU | 27,292,318 | 81,393,839 | 145,885,525 |
NORWEGIAN_N | 37,081,693 | 72,097,929 | 138,440,038 |
BENGALI | 34,246,855 | 61,951,573 | 131,278,090 |
AZERBAIJANI | 33,696,525 | 58,673,140 | 131,042,668 |
AFRIKAANS | 30,242,337 | 54,159,904 | 116,173,569 |
MONGOLIAN | 25,616,930 | 62,619,843 | 107,520,953 |
SWAHILI | 24,327,593 | 52,391,639 | 91,247,328 |
WELSH | 23,932,322 | 40,186,666 | 87,543,649 |
SOMALI | 20,633,818 | 43,929,624 | 81,849,218 |
BURMESE | 18,911,637 | 46,694,348 | 80,055,830 |
ESPERANTO | 24,162,475 | 42,917,975 | 78,914,761 |
MACEDONIAN | 18,768,365 | 36,165,866 | 63,768,547 |
IRISH | 16,827,065 | 28,201,390 | 60,390,813 |
FRISIAN | 19,449,482 | 29,811,283 | 59,983,194 |
MARATHI | 16,650,742 | 34,063,315 | 59,246,376 |
SINHALESE | 12,747,858 | 32,637,833 | 56,462,372 |
SANSKRIT | 11,624,233 | 13,411,901 | 46,844,573 |
WARAY_PHILIPPINES | 13,194,221 | 25,139,977 | 45,747,693 |
TELUGU | 14,225,578 | 25,020,984 | 44,710,886 |
KAZAKH | 12,590,688 | 25,479,548 | 44,640,443 |
GUJARATI | 10,973,045 | 27,066,607 | 43,886,069 |
JAVANESE | 11,460,132 | 20,862,987 | 43,235,936 |
MALAYALAM | 9,224,931 | 20,462,045 | 38,999,862 |
MALTESE | 11,017,805 | 19,728,654 | 38,366,818 |
BELARUSIAN | 11,352,295 | 21,358,337 | 37,962,691 |
NEPALI | 8,835,408 | 20,167,955 | 34,837,644 |
OCCITAN | 10,256,307 | 16,948,299 | 32,040,961 |
KANNADA | 6,008,742 | 12,068,448 | 28,600,499 |
UZBEK | 7,761,951 | 15,411,522 | 28,323,167 |
MALAGASY | 7,729,772 | 13,618,366 | 24,807,825 |
KURDISH | 7,022,299 | 13,569,508 | 24,707,962 |
KINYARWANDA | 5,499,790 | 12,123,757 | 21,965,484 |
SUNDANESE | 5,796,456 | 11,211,181 | 21,323,741 |
X_PIG_LATIN | 4,499,844 | 12,948,606 | 21,140,731 |
BRETON | 6,438,596 | 11,776,062 | 21,094,628 |
LUXEMBOURGISH | 5,381,089 | 10,861,358 | 20,103,225 |
TATAR | 4,879,638 | 11,371,707 | 18,595,881 |
CEBUANO | 4,843,152 | 7,635,779 | 17,571,279 |
CORSICAN | 4,608,694 | 8,128,011 | 15,785,012 |
KHMER | 4,369,065 | 8,725,173 | 15,776,727 |
VOLAPUK | 5,012,606 | 8,166,113 | 15,733,739 |
SCOTS | 4,117,582 | 7,048,981 | 14,076,360 |
PASHTO | 3,532,258 | 6,462,471 | 13,701,722 |
PUNJABI | 2,951,249 | 7,032,168 | 13,419,918 |
SCOTS_GAELIC | 3,533,904 | 5,563,327 | 12,664,666 |
ZULU | 2,605,589 | 7,183,059 | 11,326,880 |
MAORI | 2,840,226 | 2,797,603 | 11,278,285 |
HAITIAN_CREOLE | 2,890,479 | 4,377,633 | 10,465,738 |
AMHARIC | 2,207,741 | 5,575,966 | 10,195,315 |
GUARANI | 2,933,467 | 5,241,671 | 9,698,104 |
FAROESE | 2,295,231 | 4,398,177 | 9,097,236 |
YIDDISH | 2,501,206 | 4,658,451 | 9,058,367 |
INTERLINGUA | 2,173,527 | 4,503,699 | 8,952,794 |
INTERLINGUE | 2,083,903 | 4,160,768 | 8,917,449 |
MANX | 2,508,853 | 3,908,367 | 8,370,474 |
HMONG | 1,765,436 | 4,295,345 | 7,860,904 |
YORUBA | 2,175,707 | 3,980,305 | 7,785,787 |
QUECHUA | 2,169,211 | 3,642,825 | 7,125,954 |
XHOSA | 1,421,423 | 1,821,204 | 6,002,756 |
X_Inherited | 1,347,331 | 2,342,758 | 5,470,161 |
TIGRINYA | 1,358,925 | 2,810,672 | 5,338,706 |
WOLOF | 1,482,639 | 2,338,403 | 4,918,829 |
NYANJA | 1,841,031 | 2,527,580 | 4,870,942 |
OROMO | 995,928 | 2,367,024 | 4,717,752 |
TAJIK | 1,436,856 | 2,626,026 | 4,640,416 |
LINGALA | 1,847,238 | 1,427,856 | 4,497,394 |
LAOTHIAN | 1,125,083 | 1,875,476 | 4,341,474 |
SYRIAC | 1,035,611 | 1,720,316 | 4,189,837 |
RHAETO_ROMANCE | 1,021,279 | 2,006,881 | 4,184,265 |
TURKMEN | 1,052,116 | 2,092,793 | 4,072,380 |
DHIVEHI | 993,251 | 1,982,284 | 4,056,358 |
SAMOAN | 1,102,124 | 2,197,801 | 4,050,670 |
KYRGYZ | 1,072,226 | 2,266,320 | 3,863,021 |
HAWAIIAN | 1,086,648 | 1,572,114 | 3,765,937 |
BISLAMA | 863,895 | 1,265,793 | 3,493,457 |
SHONA | 743,142 | 1,446,468 | 3,301,851 |
KHASI | 839,609 | 1,589,877 | 3,045,983 |
HAUSA | 665,354 | 1,308,644 | 2,863,878 |
UIGHUR | 637,088 | 1,495,763 | 2,704,179 |
ORIYA | 703,281 | 1,174,859 | 2,614,934 |
AFAR | 717,168 | 1,393,736 | 2,595,820 |
RUNDI | 489,116 | 1,116,006 | 2,302,872 |
BASHKIR | 675,280 | 1,342,257 | 2,272,566 |
FIJIAN | 479,524 | 864,224 | 2,148,683 |
ASSAMESE | 608,311 | 1,154,094 | 2,124,276 |
TIBETAN | 460,126 | 1,133,416 | 1,931,755 |
TONGA | 657,653 | 982,036 | 1,865,090 |
MAURITIAN_CREOLE | 600,557 | 912,447 | 1,751,372 |
X_KLINGON | 470,153 | 794,508 | 1,644,442 |
SISWANT | 384,381 | 859,781 | 1,637,067 |
DZONGKHA | 308,242 | 703,081 | 1,555,403 |
BIHARI | 444,950 | 757,440 | 1,518,590 |
SESELWA | 364,897 | 629,434 | 1,502,753 |
GREENLANDIC | 366,761 | 688,240 | 1,488,818 |
TSWANA | 467,311 | 668,936 | 1,382,196 |
X_Coptic | 258,132 | 165,055 | 1,309,616 |
X_Nko | 208,081 | 476,882 | 1,238,082 |
TSONGA | 299,096 | 714,360 | 1,227,968 |
SESOTHO | 347,874 | 513,630 | 1,219,544 |
GANDA | 254,317 | 453,709 | 1,115,844 |
SINDHI | 257,203 | 419,481 | 1,012,752 |
AYMARA | 271,265 | 551,513 | 967,027 |
INUKTITUT | 217,161 | 356,068 | 771,216 |
IGBO | 240,115 | 306,835 | 767,040 |
AKAN | 191,696 | 364,848 | 740,564 |
SANGO | 128,484 | 340,026 | 707,073 |
NAURU | 160,588 | 311,511 | 600,695 |
CHEROKEE | 165,970 | 343,437 | 575,695 |
PEDI | 124,824 | 271,650 | 518,495 |
ZHUANG | 156,777 | 145,582 | 391,411 |
INUPIAK | 87,988 | 156,547 | 326,705 |
X_Samaritan | 104,462 | 169,288 | 324,546 |
VENDA | 69,312 | 91,228 | 276,483 |
ABKHAZIAN | 60,591 | 106,647 | 228,396 |
X_Gothic | 43,287 | 74,845 | 151,736 |
X_Tifinagh | 12,172 | 48,538 | 79,492 |
X_Yi | 9,496 | 25,274 | 37,509 |
KASHMIRI | 6,306 | 6,767 | 26,029 |
X_Vai | 6,171 | 3,480 | 25,191 |
X_Syloti_Nagri | 3,836 | 9,637 | 18,339 |
X_Shavian | 5,042 | 5,042 | 12,404 |
X_Bopomofo | 1,147 | 4,054 | 8,600 |
X_Deseret | 520 | 818 | 7,438 |
X_Javanese | 3,589 | 3,630 | 6,331 |
X_Buginese | 1,589 | 2,880 | 5,455 |
NDEBELE | 1,509 | 1,500 | 5,445 |
LIMBU | 160 | 2,526 | 5,098 |
X_Egyptian_Hieroglyphs | 1,230 | 1,230 | 4,878 |
X_Old_Turkic | 1,849 | 1,820 | 1,820 |
X_Tai_Tham | 461 | 598 | 1,752 |
X_Glagolitic | 246 | 127 | 1,093 |
X_Rejang | 179 | 140 | 1,017 |
X_Saurashtra | 134 | 176 | 996 |
X_Meetei_Mayek | 180 | 90 | 720 |
@mjlut Thank you very much for all the numbers.
For UD 1.3 treebanks with 10k+ test set, the results of url_dedup_tokens are:
Language | URL_dedup_tokens |
---|---|
en | 1T |
en_esl | 1T |
es_ancora | 16G |
de | 8G |
pt_br | 6.6G |
id | 4.4G |
it | 3.6G |
ru_syntagrus | 1.5G |
ru | 1.5G |
fa | 1.5G |
ar | 1.4G |
sv | 1.3G |
ro | 993M |
grc_proiel | 832M |
grc | 832M |
ja_ktc | 674M |
no | 655M |
cs_cac | 543M |
cs | 543M |
fi_ftb | 402M |
zh | 317M (with bad tokenization) |
he | 300M |
ca | 275M |
la_proiel | 236M |
sl | 192M |
bg | 180M |
gl | 122M |
et | 107M |
hi | 100M |
eu | 70M |
Therefore, for 20 out of 30 treebanks, we have 500M+ tokens, and less for the last 10 treebanks. Maybe we can try gathering some more for the <500M treebanks using the rest of the crawl-sets, which would get us (approximately; just by multiplying by 20/6) to 26/30 languages being 500M+, with the rest circa 400M, 350M, 300M and 200M.
All in all, I think this (CommonCrawl+CLD2/CLD3?) can provide "enough" data for our purposes :-)
Great. Hopefully, in UD2.0 there will be more treebanks with 10k+ test set, so perhaps we should send an email to the UD mailing list with a call for "possibly 10k+ treebanks" before we restrict the set of crawled languages (if such restriction is needed at all).
Ha ha hope we are not dropping Finnish because our test set happens to be 9140 tokens at this moment. :D There are a number of treebanks who can be re-split to fit the 10K goal. Now, is this something we want to do officially for the UD release, or is this something we just do for CoNLL only? I think any treebank with, say, 20K words or above should qualify here.
UD releases do not have any restriction on size and I hope they never will. So the 10K+ test set is for CoNLL only. However, if any treebank requires resplitting in order to fit in CoNLL, the same split should be used in the subsequent UD releases. We could justify the resplitting also by switching to v2 now, but in general I believe we want to avoid resplitting between UD releases as much as possible.
As for the 10k+ rule, I am using it just because it was listed in the Berlin notes; personally, I am not convinced it is a right one (but I was not there, and therefore I cannot object as I do not know the reasons behind it).
I do not think we need the list of treebanks to be present in CoNLL in advance, I believe we can get additional resources reasonably fast.
I completely agree with Dan. The coincidence of CoNLL and v2 provides a unique opportunity for resplitting. Ideally, no respiting should occur after (or before) that. New treebanks that are created afterwards should adhere to the 10K rule.
:+1:
I think this issue can be closed now (which is not to say that there is no work to do :)). To summarize, we promise to provide 500M+ words for most languages, and as much as possible for the others, together with pre-computed embeddings. We also say that there will be a "call for resources" so that other freely available corpora can be made part of the task if there is demand.
Hello, I am currently working on a side project that needs yoruba sentences. ngrams would also be helpful. I tried downloading the yoruba ngrams and words from https://ufal.mff.cuni.cz/~majlis/w2c/download.html. I used the unarchiver to unzip it but I always get an error that the file is corrupt. Am I doing something wrong? if so please describe to me how you acess the data. The link to my project is here -> https://github.com/Timilehin/Yoruba-Intonator/ Thanks!
I tried downloading wiki.yor.txt.gz from https://ufal.mff.cuni.cz/~majlis/w2c/download.html and decompressing it with gunzip
and it gives me a message gzip: wiki.yor.txt.gz: decompression OK, trailing garbage ignored
, but the content seems to be OK.
Then I tried web.yor.txt.gz and here most of the sentences look strange (with words like e.g. òåõíîëîãèÿäà
which seems like an encoding problem).
Note that W2C was crawled from web/wikipedia and the languages were detected automatically, so the data contains a lot of noise.
W2C is also available for download from the LINDAT repository
Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resources), but how about providing precomputed word2vec embeddings? If we don't do it, we will end up with some artificially impoverished parsers... I/We (=Google) can probably help generate these word embeddings if needed.