ufal / conll2017

CoNLL 2017 Shared Task Proposal: UD End-to-End parsing
3 stars 4 forks source link

Additional Resources? #4

Closed slavpetrov closed 7 years ago

slavpetrov commented 7 years ago

Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resources), but how about providing precomputed word2vec embeddings? If we don't do it, we will end up with some artificially impoverished parsers... I/We (=Google) can probably help generate these word embeddings if needed.

jnivre commented 7 years ago

I support this suggestion if feasible.

foxik commented 7 years ago

We thought about word embeddings as well -- that is why the raw corpora are part of the available resources (and any other texts are disallowed).

The problem with precomputed word embeddings is that it is not obvious which should be used. The Skip-Gram via Negative Sampling embeddings have been used for quite some time and seem to work well. But recently Structured Skip-Gram embeddings are being used too (wang2vec can compute them, for example), someone is using GloVe, and people might also use some multi-language embeddings (i.e., projecting words of different languages to the same space). Similar issue is with the dimension -- we are talking about tuning on ~40 corpora, so providing word embeddings with too large dimension will require a lot of computation power. Therefore, we decided to give the participants only the raw texts and left the word embedding computation to them -- any advantage gained in word embeddings is through the method, not through the amount of data.

I added an explicit note to the proposal that word embeddings can be computed only on the given raw texts.

However, if you think that we should also provide precomputed word embeddings, we can easily do so (we can compute them without any problems). If you think so, which method/dimension would you think we should use? What about something like Skip-gram with negative sampling, dimension 100 [the simension is the as in original Stack-LSTM paper]?

slavpetrov commented 7 years ago

How big are the raw corpora that are going to be made available? If they are sufficiently large to train good word embeddings on them, then everybody can use their preferred method and we don't need to supply pretrained embeddings...

foxik commented 7 years ago

The corpora we currently have are not that big (I should say quite small from the perspective of word embeddings, see https://ufal.mff.cuni.cz/~majlis/w2c/download.html for sizes).

However, there was a project to gather much larger corpora (~B tokens) for UD languages.

@fginter, are you still working on the "gather corpora for all UD languages" project? Do you think that the corpora could be used in the proposed UD end-to-end parsing CoNLL-17 shared task? The hard deadline for the data to be available would be March 2017. (BTW, we are only proposing the shared task at this moment.)

fginter commented 7 years ago

Hi

Yes, this is in the works. Our current pipeline uses CommonCrawl data and the CLD2 language recognition library. CLD2 was chosen for its speed. Currently we have a complete run of the latest CommonCrawl with CLD2 done, so we can gather stats of which UD languages are covered and to what extent. @mjluot runs the pipeline and can gather the stats.

It is clear that some of the more exotic UD languages will not be covered by CLD2, and many of these won't have that much data on the Web to begin with anyway.

On a more general note: there are people who will have corpora for some of the more rare languages. Maybe we could make a "call for resources" prior to the shared task, where people could contribute those they'd like to use, making them officially available also to others.

-Filip

foxik commented 7 years ago

Our (small) W2C corpus is also web-based and contains all UD 1.2 languages, except for Gothic, Old Church Slavonic, and Ancient Greek, so the language coverage on web is reasonably good.

As for the "call for resources" -- that is definitely a good idea. The proposal should probably contain only the data which we can ourselves deliver, but if it gets accepted, we could ask people to coin in.

So for the time being, I am interested in whether we can gather reasonably large (~gigaword) corpora for most (>=50%) of the languages -- if so, we can formulate the proposal in a way that we will provide the raw data and that the word embeddings can be computed only from those data. Then, hopefully other people could help us by providing the data for the languages we could not get from CommonCrawl.

fginter commented 7 years ago

@mjluot will give you an answer shortly for how many UD languages we may be able to get ~gigaword corpora.

jnivre commented 7 years ago

All this sounds very good, but I would still like to make a plea for providing "baseline" word embeddings as well as the raw text data. One of the hallmarks of the CoNLL shared tasks have always been to make it easy for a wide range of researchers to participate. I fear that having to compute the embeddings for all these languages from scratch (even given the data) will be too high a threshold for some of the less advanced groups (and their results will suffer as a result). The more advanced groups will probably want to use their own favorite embeddings to get an extra edge, so providing both the text data and the embeddings, although in some sense superfluous, will facilitate broad participation. Of course, all this has to be weighed against the extra burden put on the organisers.

fginter commented 7 years ago

:+1: Running word2vec (or any of its sufficiently fast alikes) with some reasonable default parameters on this data will not be hard.

foxik commented 7 years ago

Yes, we will also provide baseline word embeddings.

From my point, we are trying to decide between:

Hopefully we will be able to go with the first option. I have updated the proposal to reflect this.

mjluot commented 7 years ago

I'm working on some preliminary stats on common crawl data but hit a small delay. Will get them soon.

slavpetrov commented 7 years ago

Sounds good. BTW, we recently released CLD3, an updated version of CLD2: https://github.com/google/cld3

martinpopel commented 7 years ago

We should decide

foxik commented 7 years ago

Additional questions:

martinpopel commented 7 years ago

Just a note on multi-lingual embeddings: the main question is whether we will provide some parallel data (at least small as a seed/dictionary, each language with English, it does not have to overlap with the treebank data). Without such data it is very difficult for the participants to compute multi-lingual embeddings themselves.

dan-zeman commented 7 years ago

We don't provide parallel data. We don't have it for all languages, it is not always easy to obtain, and we don't want to exclude a language just because it does not have parallel data.

fginter commented 7 years ago

We should not exclude a language because it does not have parallel data. Some languages are better resourced than others, and that's fine. But I don't agree we should not provide parallel data or dictionaries or something which would let us bind the languages lexically. I think it is not a realistic setting, forcing only delexicalized methods, and I think it could be very exciting to see some actual lexicalized transfer happening in the shared task. Anyone is still free and welcome to default to a single language mode in their submissions, but I think we should support those who truly want to aim for lexicalized transfer.

mjluot commented 7 years ago

So, I ran language recognizer (the cld2 because of its nice python bindings) over 6 of these common crawl crawl-sets. There are altogether 20 of them. Also ran a very simple, naive, deduplication based on payload hash and urls. The final product, with better deduplication, would have less tokens, maybe even half. The first, 2012 crawl is not included in these crawls, and it should be largest. The surprisingly low token count of chinese is explained by white space tokenization.

What can be said about this? I suppose it's fairly safe to say common crawl data would provide billion token range data for the largest amount of languages.

6 crawls

all_docs 6,009,146,472 uniq_docs 2,298,220,862 uniq_urls 1,771,487,131

Language URL_dedup_tokens Doc_dedup_tokens  all_tokens
ENGLISH 1,023,587,042,618 1,677,367,510,883 3,526,675,699,180
SPANISH 16,189,502,598 34,289,896,881 63,946,174,729
FRENCH 11,087,090,556 21,428,125,601 40,592,352,282
GERMAN 8,172,422,225 16,695,836,957 31,054,854,843
PORTUGUESE 6,593,084,831 15,215,699,728 26,825,085,303
INDONESIAN 4,430,657,740 9,160,371,849 16,990,145,585
ITALIAN 3,593,687,086 7,358,890,918 14,158,495,943
VIETNAMESE 2,927,943,557 7,185,918,017 11,852,237,633
POLISH 2,822,902,691 6,665,657,235 11,713,111,394
TURKISH 2,512,646,647 5,869,505,132 10,066,737,245
MALAY 1,463,837,283 3,899,122,319 6,590,447,461
DUTCH 1,778,600,632 3,637,081,424 6,562,328,786
PERSIAN 1,547,816,326 3,476,375,556 6,525,711,329
RUSSIAN 1,490,584,813 3,417,309,389 5,879,070,412
ARABIC 1,415,924,103 3,107,635,883 5,458,046,426
SWEDISH 1,295,428,057 2,857,131,419 5,024,027,916
ROMANIAN 993,157,281 2,468,215,333 4,221,697,965
DANISH 920,113,934 1,855,634,725 3,327,735,988
GREEK 832,142,246 1,812,193,771 3,042,777,999
Japanese 674,098,452 1,418,113,948 2,497,732,129
HUNGARIAN 635,528,977 1,265,579,426 2,353,087,104
NORWEGIAN 655,426,515 1,235,519,646 2,304,746,446
CZECH 543,504,852 1,211,681,792 2,171,958,288
THAI 569,657,722 1,214,394,268 2,128,109,874
Korean 593,515,036 1,089,819,776 1,857,861,812
FINNISH 402,311,440 892,884,513 1,666,577,230
CROATIAN 316,928,672 728,650,346 1,361,129,515
ChineseT 363,401,223 790,161,390 1,321,295,490
HEBREW 300,918,111 748,869,522 1,260,833,512
SERBIAN 285,582,740 679,099,543 1,218,452,896
Chinese 317,795,368 706,205,585 1,205,332,261
CATALAN 274,061,869 468,498,516 964,881,957
SLOVAK 221,595,288 522,150,294 910,951,683
BULGARIAN 179,716,234 403,235,289 735,670,548
LATIN 236,923,977 228,414,965 701,559,746
SLOVENIAN 192,844,991 318,985,990 658,547,084
LITHUANIAN 153,997,785 325,959,450 586,431,636
UKRAINIAN 147,190,310 306,086,561 576,959,192
ALBANIAN 126,806,325 292,637,296 530,843,270
LATVIAN 110,076,820 284,247,190 492,956,121
ESTONIAN 107,684,809 257,102,186 441,159,792
GALICIAN 122,783,063 222,267,183 413,174,643
HINDI 100,144,675 237,366,680 410,405,979
BOSNIAN 82,725,643 170,657,852 336,884,587
TAMIL 63,337,522 149,605,986 261,001,034
GEORGIAN 67,323,770 151,769,031 258,893,994
ICELANDIC 58,779,458 121,762,169 223,594,013
BASQUE 70,805,720 113,421,567 217,033,427
TAGALOG 56,632,156 98,790,374 212,354,568
ARMENIAN 35,145,729 86,541,608 147,074,685
URDU 27,292,318 81,393,839 145,885,525
NORWEGIAN_N 37,081,693 72,097,929 138,440,038
BENGALI 34,246,855 61,951,573 131,278,090
AZERBAIJANI 33,696,525 58,673,140 131,042,668
AFRIKAANS 30,242,337 54,159,904 116,173,569
MONGOLIAN 25,616,930 62,619,843 107,520,953
SWAHILI 24,327,593 52,391,639 91,247,328
WELSH 23,932,322 40,186,666 87,543,649
SOMALI 20,633,818 43,929,624 81,849,218
BURMESE 18,911,637 46,694,348 80,055,830
ESPERANTO 24,162,475 42,917,975 78,914,761
MACEDONIAN 18,768,365 36,165,866 63,768,547
IRISH 16,827,065 28,201,390 60,390,813
FRISIAN 19,449,482 29,811,283 59,983,194
MARATHI 16,650,742 34,063,315 59,246,376
SINHALESE 12,747,858 32,637,833 56,462,372
SANSKRIT 11,624,233 13,411,901 46,844,573
WARAY_PHILIPPINES 13,194,221 25,139,977 45,747,693
TELUGU 14,225,578 25,020,984 44,710,886
KAZAKH 12,590,688 25,479,548 44,640,443
GUJARATI 10,973,045 27,066,607 43,886,069
JAVANESE 11,460,132 20,862,987 43,235,936
MALAYALAM 9,224,931 20,462,045 38,999,862
MALTESE 11,017,805 19,728,654 38,366,818
BELARUSIAN 11,352,295 21,358,337 37,962,691
NEPALI 8,835,408 20,167,955 34,837,644
OCCITAN 10,256,307 16,948,299 32,040,961
KANNADA 6,008,742 12,068,448 28,600,499
UZBEK 7,761,951 15,411,522 28,323,167
MALAGASY 7,729,772 13,618,366 24,807,825
KURDISH 7,022,299 13,569,508 24,707,962
KINYARWANDA 5,499,790 12,123,757 21,965,484
SUNDANESE 5,796,456 11,211,181 21,323,741
X_PIG_LATIN 4,499,844 12,948,606 21,140,731
BRETON 6,438,596 11,776,062 21,094,628
LUXEMBOURGISH 5,381,089 10,861,358 20,103,225
TATAR 4,879,638 11,371,707 18,595,881
CEBUANO 4,843,152 7,635,779 17,571,279
CORSICAN 4,608,694 8,128,011 15,785,012
KHMER 4,369,065 8,725,173 15,776,727
VOLAPUK 5,012,606 8,166,113 15,733,739
SCOTS 4,117,582 7,048,981 14,076,360
PASHTO 3,532,258 6,462,471 13,701,722
PUNJABI 2,951,249 7,032,168 13,419,918
SCOTS_GAELIC 3,533,904 5,563,327 12,664,666
ZULU 2,605,589 7,183,059 11,326,880
MAORI 2,840,226 2,797,603 11,278,285
HAITIAN_CREOLE 2,890,479 4,377,633 10,465,738
AMHARIC 2,207,741 5,575,966 10,195,315
GUARANI 2,933,467 5,241,671 9,698,104
FAROESE 2,295,231 4,398,177 9,097,236
YIDDISH 2,501,206 4,658,451 9,058,367
INTERLINGUA 2,173,527 4,503,699 8,952,794
INTERLINGUE 2,083,903 4,160,768 8,917,449
MANX 2,508,853 3,908,367 8,370,474
HMONG 1,765,436 4,295,345 7,860,904
YORUBA 2,175,707 3,980,305 7,785,787
QUECHUA 2,169,211 3,642,825 7,125,954
XHOSA 1,421,423 1,821,204 6,002,756
X_Inherited 1,347,331 2,342,758 5,470,161
TIGRINYA 1,358,925 2,810,672 5,338,706
WOLOF 1,482,639 2,338,403 4,918,829
NYANJA 1,841,031 2,527,580 4,870,942
OROMO 995,928 2,367,024 4,717,752
TAJIK 1,436,856 2,626,026 4,640,416
LINGALA 1,847,238 1,427,856 4,497,394
LAOTHIAN 1,125,083 1,875,476 4,341,474
SYRIAC 1,035,611 1,720,316 4,189,837
RHAETO_ROMANCE 1,021,279 2,006,881 4,184,265
TURKMEN 1,052,116 2,092,793 4,072,380
DHIVEHI 993,251 1,982,284 4,056,358
SAMOAN 1,102,124 2,197,801 4,050,670
KYRGYZ 1,072,226 2,266,320 3,863,021
HAWAIIAN 1,086,648 1,572,114 3,765,937
BISLAMA 863,895 1,265,793 3,493,457
SHONA 743,142 1,446,468 3,301,851
KHASI 839,609 1,589,877 3,045,983
HAUSA 665,354 1,308,644 2,863,878
UIGHUR 637,088 1,495,763 2,704,179
ORIYA 703,281 1,174,859 2,614,934
AFAR 717,168 1,393,736 2,595,820
RUNDI 489,116 1,116,006 2,302,872
BASHKIR 675,280 1,342,257 2,272,566
FIJIAN 479,524 864,224 2,148,683
ASSAMESE 608,311 1,154,094 2,124,276
TIBETAN 460,126 1,133,416 1,931,755
TONGA 657,653 982,036 1,865,090
MAURITIAN_CREOLE 600,557 912,447 1,751,372
X_KLINGON 470,153 794,508 1,644,442
SISWANT 384,381 859,781 1,637,067
DZONGKHA 308,242 703,081 1,555,403
BIHARI 444,950 757,440 1,518,590
SESELWA 364,897 629,434 1,502,753
GREENLANDIC 366,761 688,240 1,488,818
TSWANA 467,311 668,936 1,382,196
X_Coptic 258,132 165,055 1,309,616
X_Nko 208,081 476,882 1,238,082
TSONGA 299,096 714,360 1,227,968
SESOTHO 347,874 513,630 1,219,544
GANDA 254,317 453,709 1,115,844
SINDHI 257,203 419,481 1,012,752
AYMARA 271,265 551,513 967,027
INUKTITUT 217,161 356,068 771,216
IGBO 240,115 306,835 767,040
AKAN 191,696 364,848 740,564
SANGO 128,484 340,026 707,073
NAURU 160,588 311,511 600,695
CHEROKEE 165,970 343,437 575,695
PEDI 124,824 271,650 518,495
ZHUANG 156,777 145,582 391,411
INUPIAK 87,988 156,547 326,705
X_Samaritan 104,462 169,288 324,546
VENDA 69,312 91,228 276,483
ABKHAZIAN 60,591 106,647 228,396
X_Gothic 43,287 74,845 151,736
X_Tifinagh 12,172 48,538 79,492
X_Yi 9,496 25,274 37,509
KASHMIRI 6,306 6,767 26,029
X_Vai 6,171 3,480 25,191
X_Syloti_Nagri 3,836 9,637 18,339
X_Shavian 5,042 5,042 12,404
X_Bopomofo 1,147 4,054 8,600
X_Deseret 520 818 7,438
X_Javanese 3,589 3,630 6,331
X_Buginese 1,589 2,880 5,455
NDEBELE 1,509 1,500 5,445
LIMBU 160 2,526 5,098
X_Egyptian_Hieroglyphs 1,230 1,230 4,878
X_Old_Turkic 1,849 1,820 1,820
X_Tai_Tham 461 598 1,752
X_Glagolitic 246 127 1,093
X_Rejang 179 140 1,017
X_Saurashtra 134 176 996
X_Meetei_Mayek 180 90 720
foxik commented 7 years ago

@mjlut Thank you very much for all the numbers.

For UD 1.3 treebanks with 10k+ test set, the results of url_dedup_tokens are:

Language URL_dedup_tokens
en 1T
en_esl 1T
es_ancora 16G
de 8G
pt_br 6.6G
id 4.4G
it 3.6G
ru_syntagrus 1.5G
ru 1.5G
fa 1.5G
ar 1.4G
sv 1.3G
ro 993M
grc_proiel 832M
grc 832M
ja_ktc 674M
no 655M
cs_cac 543M
cs 543M
fi_ftb 402M
zh 317M (with bad tokenization)
he 300M
ca 275M
la_proiel 236M
sl 192M
bg 180M
gl 122M
et 107M
hi 100M
eu 70M

Therefore, for 20 out of 30 treebanks, we have 500M+ tokens, and less for the last 10 treebanks. Maybe we can try gathering some more for the <500M treebanks using the rest of the crawl-sets, which would get us (approximately; just by multiplying by 20/6) to 26/30 languages being 500M+, with the rest circa 400M, 350M, 300M and 200M.

All in all, I think this (CommonCrawl+CLD2/CLD3?) can provide "enough" data for our purposes :-)

martinpopel commented 7 years ago

Great. Hopefully, in UD2.0 there will be more treebanks with 10k+ test set, so perhaps we should send an email to the UD mailing list with a call for "possibly 10k+ treebanks" before we restrict the set of crawled languages (if such restriction is needed at all).

fginter commented 7 years ago

Ha ha hope we are not dropping Finnish because our test set happens to be 9140 tokens at this moment. :D There are a number of treebanks who can be re-split to fit the 10K goal. Now, is this something we want to do officially for the UD release, or is this something we just do for CoNLL only? I think any treebank with, say, 20K words or above should qualify here.

dan-zeman commented 7 years ago

UD releases do not have any restriction on size and I hope they never will. So the 10K+ test set is for CoNLL only. However, if any treebank requires resplitting in order to fit in CoNLL, the same split should be used in the subsequent UD releases. We could justify the resplitting also by switching to v2 now, but in general I believe we want to avoid resplitting between UD releases as much as possible.

foxik commented 7 years ago

As for the 10k+ rule, I am using it just because it was listed in the Berlin notes; personally, I am not convinced it is a right one (but I was not there, and therefore I cannot object as I do not know the reasons behind it).

I do not think we need the list of treebanks to be present in CoNLL in advance, I believe we can get additional resources reasonably fast.

jnivre commented 7 years ago

I completely agree with Dan. The coincidence of CoNLL and v2 provides a unique opportunity for resplitting. Ideally, no respiting should occur after (or before) that. New treebanks that are created afterwards should adhere to the 10K rule.

fginter commented 7 years ago

:+1:

dan-zeman commented 7 years ago

I think this issue can be closed now (which is not to say that there is no work to do :)). To summarize, we promise to provide 500M+ words for most languages, and as much as possible for the others, together with pre-computed embeddings. We also say that there will be a "call for resources" so that other freely available corpora can be made part of the task if there is demand.

Timilehin commented 6 years ago

Hello, I am currently working on a side project that needs yoruba sentences. ngrams would also be helpful. I tried downloading the yoruba ngrams and words from https://ufal.mff.cuni.cz/~majlis/w2c/download.html. I used the unarchiver to unzip it but I always get an error that the file is corrupt. Am I doing something wrong? if so please describe to me how you acess the data. The link to my project is here -> https://github.com/Timilehin/Yoruba-Intonator/ Thanks!

martinpopel commented 6 years ago

I tried downloading wiki.yor.txt.gz from https://ufal.mff.cuni.cz/~majlis/w2c/download.html and decompressing it with gunzip and it gives me a message gzip: wiki.yor.txt.gz: decompression OK, trailing garbage ignored, but the content seems to be OK. Then I tried web.yor.txt.gz and here most of the sentences look strange (with words like e.g. òåõíîëîãèÿäà which seems like an encoding problem). Note that W2C was crawled from web/wikipedia and the languages were detected automatically, so the data contains a lot of noise. W2C is also available for download from the LINDAT repository