Additional Resources? - Githubissues

slavpetrov commented 8 years ago

Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resources), but how about providing precomputed word2vec embeddings? If we don't do it, we will end up with some artificially impoverished parsers... I/We (=Google) can probably help generate these word embeddings if needed.

jnivre commented 8 years ago

I support this suggestion if feasible.

foxik commented 8 years ago

We thought about word embeddings as well -- that is why the raw corpora are part of the available resources (and any other texts are disallowed).

The problem with precomputed word embeddings is that it is not obvious which should be used. The Skip-Gram via Negative Sampling embeddings have been used for quite some time and seem to work well. But recently Structured Skip-Gram embeddings are being used too (wang2vec can compute them, for example), someone is using GloVe, and people might also use some multi-language embeddings (i.e., projecting words of different languages to the same space). Similar issue is with the dimension -- we are talking about tuning on ~40 corpora, so providing word embeddings with too large dimension will require a lot of computation power. Therefore, we decided to give the participants only the raw texts and left the word embedding computation to them -- any advantage gained in word embeddings is through the method, not through the amount of data.

I added an explicit note to the proposal that word embeddings can be computed only on the given raw texts.

However, if you think that we should also provide precomputed word embeddings, we can easily do so (we can compute them without any problems). If you think so, which method/dimension would you think we should use? What about something like Skip-gram with negative sampling, dimension 100 [the simension is the as in original Stack-LSTM paper]?

slavpetrov commented 8 years ago

How big are the raw corpora that are going to be made available? If they are sufficiently large to train good word embeddings on them, then everybody can use their preferred method and we don't need to supply pretrained embeddings...

foxik commented 8 years ago

The corpora we currently have are not that big (I should say quite small from the perspective of word embeddings, see https://ufal.mff.cuni.cz/~majlis/w2c/download.html for sizes).

However, there was a project to gather much larger corpora (~B tokens) for UD languages.

@fginter, are you still working on the "gather corpora for all UD languages" project? Do you think that the corpora could be used in the proposed UD end-to-end parsing CoNLL-17 shared task? The hard deadline for the data to be available would be March 2017. (BTW, we are only proposing the shared task at this moment.)

fginter commented 8 years ago

Hi

Yes, this is in the works. Our current pipeline uses CommonCrawl data and the CLD2 language recognition library. CLD2 was chosen for its speed. Currently we have a complete run of the latest CommonCrawl with CLD2 done, so we can gather stats of which UD languages are covered and to what extent. @mjluot runs the pipeline and can gather the stats.

It is clear that some of the more exotic UD languages will not be covered by CLD2, and many of these won't have that much data on the Web to begin with anyway.

On a more general note: there are people who will have corpora for some of the more rare languages. Maybe we could make a "call for resources" prior to the shared task, where people could contribute those they'd like to use, making them officially available also to others.

-Filip

foxik commented 8 years ago

Our (small) W2C corpus is also web-based and contains all UD 1.2 languages, except for Gothic, Old Church Slavonic, and Ancient Greek, so the language coverage on web is reasonably good.

As for the "call for resources" -- that is definitely a good idea. The proposal should probably contain only the data which we can ourselves deliver, but if it gets accepted, we could ask people to coin in.

So for the time being, I am interested in whether we can gather reasonably large (~gigaword) corpora for most (>=50%) of the languages -- if so, we can formulate the proposal in a way that we will provide the raw data and that the word embeddings can be computed only from those data. Then, hopefully other people could help us by providing the data for the languages we could not get from CommonCrawl.

fginter commented 8 years ago

@mjluot will give you an answer shortly for how many UD languages we may be able to get ~gigaword corpora.

jnivre commented 8 years ago

All this sounds very good, but I would still like to make a plea for providing "baseline" word embeddings as well as the raw text data. One of the hallmarks of the CoNLL shared tasks have always been to make it easy for a wide range of researchers to participate. I fear that having to compute the embeddings for all these languages from scratch (even given the data) will be too high a threshold for some of the less advanced groups (and their results will suffer as a result). The more advanced groups will probably want to use their own favorite embeddings to get an extra edge, so providing both the text data and the embeddings, although in some sense superfluous, will facilitate broad participation. Of course, all this has to be weighed against the extra burden put on the organisers.

fginter commented 8 years ago

:+1: Running word2vec (or any of its sufficiently fast alikes) with some reasonable default parameters on this data will not be hard.

foxik commented 8 years ago

Yes, we will also provide baseline word embeddings.

From my point, we are trying to decide between:

we have enough raw text for the participants, that the word embeddings can be computed only from that; in this case we provide some baseline embeddings, but participants can compute their own embeddings from the same data
we do not have enough raw text and we should try to provide word embeddings computed on larger data; in this case, participants are depending on which exact embeddings we give them

Hopefully we will be able to go with the first option. I have updated the proposal to reflect this.

mjluot commented 8 years ago

I'm working on some preliminary stats on common crawl data but hit a small delay. Will get them soon.

slavpetrov commented 8 years ago

Sounds good. BTW, we recently released CLD3, an updated version of CLD2: https://github.com/google/cld3

martinpopel commented 8 years ago

We should decide

[ ] Embeddings of tokens or (syntactic) words? I suggest words, but for this we need UD 2.0 guidlelines and data for multiword tokens for all the languages and a trained tokenizer.
[ ] What tokenization tool? I suggest UDPipe.
[ ] What preprocessing? E.g. convert all digits to # symbol etc.
[ ] Multilingual embeddings (projecting all languages to the same vector space)?
[ ] Which toolkit? (NB: word2vec cannot do multilingual embeddings, but there are several fast toolkits which can train them jointly with just small parallel seed data.)

foxik commented 8 years ago

I would use words (training the tokenizer is not that hard)
UDPipe :-)
For the preprocessing:
- I am not sure about lowercasing, but we should probably do it, as we will probably have only moderate-sized raw data
- I would convert all multi-digit numerals to # (so we would keep 0..9 intact)
I think we should not provide multilingual embeddings; if people desire them, they can compute them by themselves (they have the means)
I would use word2vec (or wang2vec, see below)

Additional questions:

should we use structured skip-gram? In my opinion its usage has been growing, especially for POS tagging and parsing, and we can use wang2vec to compute it, but we will have only moderately-sized raw data; personally, I would use it
maybe we should tokenize the raw texts we will distribute (in a manner consistent with corresponding UD data) -- many participants will probably work on a word level, and we have to tokenize the raw texts anyway (to compute the word embeddings); we could produce the raw texts in CoNLL-U format, with multi-word tokens, where only Form and SpaceAfter=No would be present

martinpopel commented 8 years ago

Just a note on multi-lingual embeddings: the main question is whether we will provide some parallel data (at least small as a seed/dictionary, each language with English, it does not have to overlap with the treebank data). Without such data it is very difficult for the participants to compute multi-lingual embeddings themselves.

dan-zeman commented 8 years ago

We don't provide parallel data. We don't have it for all languages, it is not always easy to obtain, and we don't want to exclude a language just because it does not have parallel data.

fginter commented 8 years ago

We should not exclude a language because it does not have parallel data. Some languages are better resourced than others, and that's fine. But I don't agree we should not provide parallel data or dictionaries or something which would let us bind the languages lexically. I think it is not a realistic setting, forcing only delexicalized methods, and I think it could be very exciting to see some actual lexicalized transfer happening in the shared task. Anyone is still free and welcome to default to a single language mode in their submissions, but I think we should support those who truly want to aim for lexicalized transfer.

mjluot commented 8 years ago

So, I ran language recognizer (the cld2 because of its nice python bindings) over 6 of these common crawl crawl-sets. There are altogether 20 of them. Also ran a very simple, naive, deduplication based on payload hash and urls. The final product, with better deduplication, would have less tokens, maybe even half. The first, 2012 crawl is not included in these crawls, and it should be largest. The surprisingly low token count of chinese is explained by white space tokenization.

What can be said about this? I suppose it's fairly safe to say common crawl data would provide billion token range data for the largest amount of languages.

6 crawls

all_docs 6,009,146,472 uniq_docs 2,298,220,862 uniq_urls 1,771,487,131

Language	URL_dedup_tokens	Doc_dedup_tokens	all_tokens
ENGLISH	1,023,587,042,618	1,677,367,510,883	3,526,675,699,180
SPANISH	16,189,502,598	34,289,896,881	63,946,174,729
FRENCH	11,087,090,556	21,428,125,601	40,592,352,282
GERMAN	8,172,422,225	16,695,836,957	31,054,854,843
PORTUGUESE	6,593,084,831	15,215,699,728	26,825,085,303
INDONESIAN	4,430,657,740	9,160,371,849	16,990,145,585
ITALIAN	3,593,687,086	7,358,890,918	14,158,495,943
VIETNAMESE	2,927,943,557	7,185,918,017	11,852,237,633
POLISH	2,822,902,691	6,665,657,235	11,713,111,394
TURKISH	2,512,646,647	5,869,505,132	10,066,737,245
MALAY	1,463,837,283	3,899,122,319	6,590,447,461
DUTCH	1,778,600,632	3,637,081,424	6,562,328,786
PERSIAN	1,547,816,326	3,476,375,556	6,525,711,329
RUSSIAN	1,490,584,813	3,417,309,389	5,879,070,412
ARABIC	1,415,924,103	3,107,635,883	5,458,046,426
SWEDISH	1,295,428,057	2,857,131,419	5,024,027,916
ROMANIAN	993,157,281	2,468,215,333	4,221,697,965
DANISH	920,113,934	1,855,634,725	3,327,735,988
GREEK	832,142,246	1,812,193,771	3,042,777,999
Japanese	674,098,452	1,418,113,948	2,497,732,129
HUNGARIAN	635,528,977	1,265,579,426	2,353,087,104
NORWEGIAN	655,426,515	1,235,519,646	2,304,746,446
CZECH	543,504,852	1,211,681,792	2,171,958,288
THAI	569,657,722	1,214,394,268	2,128,109,874
Korean	593,515,036	1,089,819,776	1,857,861,812
FINNISH	402,311,440	892,884,513	1,666,577,230
CROATIAN	316,928,672	728,650,346	1,361,129,515
ChineseT	363,401,223	790,161,390	1,321,295,490
HEBREW	300,918,111	748,869,522	1,260,833,512
SERBIAN	285,582,740	679,099,543	1,218,452,896
Chinese	317,795,368	706,205,585	1,205,332,261
CATALAN	274,061,869	468,498,516	964,881,957
SLOVAK	221,595,288	522,150,294	910,951,683
BULGARIAN	179,716,234	403,235,289	735,670,548
LATIN	236,923,977	228,414,965	701,559,746
SLOVENIAN	192,844,991	318,985,990	658,547,084
LITHUANIAN	153,997,785	325,959,450	586,431,636
UKRAINIAN	147,190,310	306,086,561	576,959,192
ALBANIAN	126,806,325	292,637,296	530,843,270
LATVIAN	110,076,820	284,247,190	492,956,121
ESTONIAN	107,684,809	257,102,186	441,159,792
GALICIAN	122,783,063	222,267,183	413,174,643
HINDI	100,144,675	237,366,680	410,405,979
BOSNIAN	82,725,643	170,657,852	336,884,587
TAMIL	63,337,522	149,605,986	261,001,034
GEORGIAN	67,323,770	151,769,031	258,893,994
ICELANDIC	58,779,458	121,762,169	223,594,013
BASQUE	70,805,720	113,421,567	217,033,427
TAGALOG	56,632,156	98,790,374	212,354,568
ARMENIAN	35,145,729	86,541,608	147,074,685
URDU	27,292,318	81,393,839	145,885,525
NORWEGIAN_N	37,081,693	72,097,929	138,440,038
BENGALI	34,246,855	61,951,573	131,278,090
AZERBAIJANI	33,696,525	58,673,140	131,042,668
AFRIKAANS	30,242,337	54,159,904	116,173,569
MONGOLIAN	25,616,930	62,619,843	107,520,953
SWAHILI	24,327,593	52,391,639	91,247,328
WELSH	23,932,322	40,186,666	87,543,649
SOMALI	20,633,818	43,929,624	81,849,218
BURMESE	18,911,637	46,694,348	80,055,830
ESPERANTO	24,162,475	42,917,975	78,914,761
MACEDONIAN	18,768,365	36,165,866	63,768,547
IRISH	16,827,065	28,201,390	60,390,813
FRISIAN	19,449,482	29,811,283	59,983,194
MARATHI	16,650,742	34,063,315	59,246,376
SINHALESE	12,747,858	32,637,833	56,462,372
SANSKRIT	11,624,233	13,411,901	46,844,573
WARAY_PHILIPPINES	13,194,221	25,139,977	45,747,693
TELUGU	14,225,578	25,020,984	44,710,886
KAZAKH	12,590,688	25,479,548	44,640,443
GUJARATI	10,973,045	27,066,607	43,886,069
JAVANESE	11,460,132	20,862,987	43,235,936
MALAYALAM	9,224,931	20,462,045	38,999,862
MALTESE	11,017,805	19,728,654	38,366,818
BELARUSIAN	11,352,295	21,358,337	37,962,691
NEPALI	8,835,408	20,167,955	34,837,644
OCCITAN	10,256,307	16,948,299	32,040,961
KANNADA	6,008,742	12,068,448	28,600,499
UZBEK	7,761,951	15,411,522	28,323,167
MALAGASY	7,729,772	13,618,366	24,807,825
KURDISH	7,022,299	13,569,508	24,707,962
KINYARWANDA	5,499,790	12,123,757	21,965,484
SUNDANESE	5,796,456	11,211,181	21,323,741
X_PIG_LATIN	4,499,844	12,948,606	21,140,731
BRETON	6,438,596	11,776,062	21,094,628
LUXEMBOURGISH	5,381,089	10,861,358	20,103,225
TATAR	4,879,638	11,371,707	18,595,881
CEBUANO	4,843,152	7,635,779	17,571,279
CORSICAN	4,608,694	8,128,011	15,785,012
KHMER	4,369,065	8,725,173	15,776,727
VOLAPUK	5,012,606	8,166,113	15,733,739
SCOTS	4,117,582	7,048,981	14,076,360
PASHTO	3,532,258	6,462,471	13,701,722
PUNJABI	2,951,249	7,032,168	13,419,918
SCOTS_GAELIC	3,533,904	5,563,327	12,664,666
ZULU	2,605,589	7,183,059	11,326,880
MAORI	2,840,226	2,797,603	11,278,285
HAITIAN_CREOLE	2,890,479	4,377,633	10,465,738
AMHARIC	2,207,741	5,575,966	10,195,315
GUARANI	2,933,467	5,241,671	9,698,104
FAROESE	2,295,231	4,398,177	9,097,236
YIDDISH	2,501,206	4,658,451	9,058,367
INTERLINGUA	2,173,527	4,503,699	8,952,794
INTERLINGUE	2,083,903	4,160,768	8,917,449
MANX	2,508,853	3,908,367	8,370,474
HMONG	1,765,436	4,295,345	7,860,904
YORUBA	2,175,707	3,980,305	7,785,787
QUECHUA	2,169,211	3,642,825	7,125,954
XHOSA	1,421,423	1,821,204	6,002,756
X_Inherited	1,347,331	2,342,758	5,470,161
TIGRINYA	1,358,925	2,810,672	5,338,706
WOLOF	1,482,639	2,338,403	4,918,829
NYANJA	1,841,031	2,527,580	4,870,942
OROMO	995,928	2,367,024	4,717,752
TAJIK	1,436,856	2,626,026	4,640,416
LINGALA	1,847,238	1,427,856	4,497,394
LAOTHIAN	1,125,083	1,875,476	4,341,474
SYRIAC	1,035,611	1,720,316	4,189,837
RHAETO_ROMANCE	1,021,279	2,006,881	4,184,265
TURKMEN	1,052,116	2,092,793	4,072,380
DHIVEHI	993,251	1,982,284	4,056,358
SAMOAN	1,102,124	2,197,801	4,050,670
KYRGYZ	1,072,226	2,266,320	3,863,021
HAWAIIAN	1,086,648	1,572,114	3,765,937
BISLAMA	863,895	1,265,793	3,493,457
SHONA	743,142	1,446,468	3,301,851
KHASI	839,609	1,589,877	3,045,983
HAUSA	665,354	1,308,644	2,863,878
UIGHUR	637,088	1,495,763	2,704,179
ORIYA	703,281	1,174,859	2,614,934
AFAR	717,168	1,393,736	2,595,820
RUNDI	489,116	1,116,006	2,302,872
BASHKIR	675,280	1,342,257	2,272,566
FIJIAN	479,524	864,224	2,148,683
ASSAMESE	608,311	1,154,094	2,124,276
TIBETAN	460,126	1,133,416	1,931,755
TONGA	657,653	982,036	1,865,090
MAURITIAN_CREOLE	600,557	912,447	1,751,372
X_KLINGON	470,153	794,508	1,644,442
SISWANT	384,381	859,781	1,637,067
DZONGKHA	308,242	703,081	1,555,403
BIHARI	444,950	757,440	1,518,590
SESELWA	364,897	629,434	1,502,753
GREENLANDIC	366,761	688,240	1,488,818
TSWANA	467,311	668,936	1,382,196
X_Coptic	258,132	165,055	1,309,616
X_Nko	208,081	476,882	1,238,082
TSONGA	299,096	714,360	1,227,968
SESOTHO	347,874	513,630	1,219,544
GANDA	254,317	453,709	1,115,844
SINDHI	257,203	419,481	1,012,752
AYMARA	271,265	551,513	967,027
INUKTITUT	217,161	356,068	771,216
IGBO	240,115	306,835	767,040
AKAN	191,696	364,848	740,564
SANGO	128,484	340,026	707,073
NAURU	160,588	311,511	600,695
CHEROKEE	165,970	343,437	575,695
PEDI	124,824	271,650	518,495
ZHUANG	156,777	145,582	391,411
INUPIAK	87,988	156,547	326,705
X_Samaritan	104,462	169,288	324,546
VENDA	69,312	91,228	276,483
ABKHAZIAN	60,591	106,647	228,396
X_Gothic	43,287	74,845	151,736
X_Tifinagh	12,172	48,538	79,492
X_Yi	9,496	25,274	37,509
KASHMIRI	6,306	6,767	26,029
X_Vai	6,171	3,480	25,191
X_Syloti_Nagri	3,836	9,637	18,339
X_Shavian	5,042	5,042	12,404
X_Bopomofo	1,147	4,054	8,600
X_Deseret	520	818	7,438
X_Javanese	3,589	3,630	6,331
X_Buginese	1,589	2,880	5,455
NDEBELE	1,509	1,500	5,445
LIMBU	160	2,526	5,098
X_Egyptian_Hieroglyphs	1,230	1,230	4,878
X_Old_Turkic	1,849	1,820	1,820
X_Tai_Tham	461	598	1,752
X_Glagolitic	246	127	1,093
X_Rejang	179	140	1,017
X_Saurashtra	134	176	996
X_Meetei_Mayek	180	90	720

foxik commented 8 years ago

@mjlut Thank you very much for all the numbers.

For UD 1.3 treebanks with 10k+ test set, the results of url_dedup_tokens are:

Language	URL_dedup_tokens
en	1T
en_esl	1T
es_ancora	16G
de	8G
pt_br	6.6G
id	4.4G
it	3.6G
ru_syntagrus	1.5G
ru	1.5G
fa	1.5G
ar	1.4G
sv	1.3G
ro	993M
grc_proiel	832M
grc	832M
ja_ktc	674M
no	655M
cs_cac	543M
cs	543M
fi_ftb	402M
zh	317M (with bad tokenization)
he	300M
ca	275M
la_proiel	236M
sl	192M
bg	180M
gl	122M
et	107M
hi	100M
eu	70M

Therefore, for 20 out of 30 treebanks, we have 500M+ tokens, and less for the last 10 treebanks. Maybe we can try gathering some more for the <500M treebanks using the rest of the crawl-sets, which would get us (approximately; just by multiplying by 20/6) to 26/30 languages being 500M+, with the rest circa 400M, 350M, 300M and 200M.

All in all, I think this (CommonCrawl+CLD2/CLD3?) can provide "enough" data for our purposes :-)

martinpopel commented 8 years ago

Great. Hopefully, in UD2.0 there will be more treebanks with 10k+ test set, so perhaps we should send an email to the UD mailing list with a call for "possibly 10k+ treebanks" before we restrict the set of crawled languages (if such restriction is needed at all).

fginter commented 8 years ago

Ha ha hope we are not dropping Finnish because our test set happens to be 9140 tokens at this moment. :D There are a number of treebanks who can be re-split to fit the 10K goal. Now, is this something we want to do officially for the UD release, or is this something we just do for CoNLL only? I think any treebank with, say, 20K words or above should qualify here.

dan-zeman commented 8 years ago

UD releases do not have any restriction on size and I hope they never will. So the 10K+ test set is for CoNLL only. However, if any treebank requires resplitting in order to fit in CoNLL, the same split should be used in the subsequent UD releases. We could justify the resplitting also by switching to v2 now, but in general I believe we want to avoid resplitting between UD releases as much as possible.

foxik commented 8 years ago

As for the 10k+ rule, I am using it just because it was listed in the Berlin notes; personally, I am not convinced it is a right one (but I was not there, and therefore I cannot object as I do not know the reasons behind it).

I do not think we need the list of treebanks to be present in CoNLL in advance, I believe we can get additional resources reasonably fast.

jnivre commented 8 years ago

I completely agree with Dan. The coincidence of CoNLL and v2 provides a unique opportunity for resplitting. Ideally, no respiting should occur after (or before) that. New treebanks that are created afterwards should adhere to the 10K rule.

fginter commented 8 years ago

:+1:

dan-zeman commented 8 years ago

I think this issue can be closed now (which is not to say that there is no work to do :)). To summarize, we promise to provide 500M+ words for most languages, and as much as possible for the others, together with pre-computed embeddings. We also say that there will be a "call for resources" so that other freely available corpora can be made part of the task if there is demand.

Timilehin commented 7 years ago

Hello, I am currently working on a side project that needs yoruba sentences. ngrams would also be helpful. I tried downloading the yoruba ngrams and words from https://ufal.mff.cuni.cz/~majlis/w2c/download.html. I used the unarchiver to unzip it but I always get an error that the file is corrupt. Am I doing something wrong? if so please describe to me how you acess the data. The link to my project is here -> https://github.com/Timilehin/Yoruba-Intonator/ Thanks!

martinpopel commented 7 years ago

I tried downloading wiki.yor.txt.gz from https://ufal.mff.cuni.cz/~majlis/w2c/download.html and decompressing it with gunzip and it gives me a message gzip: wiki.yor.txt.gz: decompression OK, trailing garbage ignored, but the content seems to be OK. Then I tried web.yor.txt.gz and here most of the sentences look strange (with words like e.g. òåõíîëîãèÿäà which seems like an encoding problem). Note that W2C was crawled from web/wikipedia and the languages were detected automatically, so the data contains a lot of noise. W2C is also available for download from the LINDAT repository

ufal / conll2017

Additional Resources? #4