ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
362 stars 76 forks source link

issue with lemma which has too many word forms #130

Closed jwijffels closed 4 years ago

jwijffels commented 4 years ago

I was building today a model on a Dutch corpus from the 17th-19th century available at https://ivdnt.org/taalmaterialen/2282-pp-brievenalsbuit-j.

I trained it for building a lemmatiser using the following parameters models=1;iterations=1;templates_1=lemmatizer;guesser_suffix_rules_1=6;guesser_enrich_dictionary_1=4;guesser_prefixes_max_1=4;use_lemma_1=1;use_xpostag_1=0;use_feats_1=0;provide_lemma_1=1;provide_xpostag_1=0;provide_feats_1=0;prune_features_1=0

That failed and with the following training_failure.

Message was: Should encode value 309 in one byte!

After some debugging and inspection, it appeared I had a lemma in my corpus which had 309 distinctive word forms. This resulted in a training failure in the morphodita call of

dictionary<LemmaAddinfo>::encode(binary_encoder& enc) {
...
enc.add_1B(lemma.forms.size());
}

It's probably unexpected behaviour that a lemma has so many word forms. Just putting this note here in case someone stumbles on the same issue.

foxik commented 4 years ago

Yeah, there are several unfortunate limits like this in UDPipe -- I originally implemented the tagger for Czech, and some hyperparameters are not general enough for other languages. Another unfortunate limit is a maximum length of a lemma (255 bytes) and another is limit on the number of distinct tags (65535). But the limit you reported (unique forms of a single lemma) is the most problematic - for example for German, where there are composite words and the lemma is only one of them, we hit it quite frequently.

In the new rewrite there will be no such limits (I plan to use variable-length ints all the time), but there is no real time frame for it, as is obvious from the last 1-2 years... :-(

DanielCoutoVale commented 4 years ago

Foxik, the issue related to German compounds can be solved by a tokenizer. For instance, the wordings "meinen Vertrag" (my contract) and "mein Vertragsende" (the end of my contract) could be tokenized [meinen, Vertrag] and [mein, Vertrags, ende]. This would have two effects: 1. on the one side, the number of words for the same lemma can be drastically reduced; 2. on the other side, the parser can create more reasonable dependency structure such as [[meinen 2], [Vertrag 0]] and [[mein 2], [Vertrags 3], [ende 0]].

I think I am also facing a similar issue concerning the words per lemma limit for Latin and Ancient Greek. I guess I could create a different "lemma" for each verb aspect, equating the lemma to the stem (dic o, dix i, dict u). This would divide the Latin verbs for each current lemma into three subgroups and the Ancient Greek ones into four, thus reducing the number of verbs per lemma. Maybe that is the way to proceed.

foxik commented 4 years ago

The more fine-grained tokenization would be a workaround, but we would need training data with the same tokenization -- and since German HDT uses coarse tokenization, we need to keep it.

The way to proceed is to rewrite the internals not to have any limits on the number of forms for a lemma -- it is there for technical reasons only, but the new version will be without it...

foxik commented 4 years ago

BTW, there is an undocumented tagger option dictionary_flat_lemmas, which is set to

dictionary_flat_lemmas=-,Aktie,Anbieter,Angebot,Bereich,Chef,Chip,Dienst,Firma,Funktion,Gerät,Geschäft,Gruppe,Hersteller,Karte,Konzern

for German_HDT, which allows ignoring lemmas with a lot of forms (usually because of compounds), which you can use in an emergency; but a better UDPipe is definitely a way to go...

DanielCoutoVale commented 3 years ago

@foxik, you said: «Since German HDT uses coarse tokenization, we need to keep it. The way to proceed is to rewrite the internals not to have any limits on the number of forms for a lemma.»

From a linguist's perpective and from a computer scientist's perspective, such an approach whereby compounds of lexical items are treated as if they were a single lexical item is not easy to defend. If lexical items are the most delicate variation in language, they are usually the way we call phenomena around us. So if we allow combinations of lexical items to be enumerated as if they were a single item in a list, not as two items in the same list, we will end up with a list of grammatical features that is as large as the list of lexical items in the language. This is horrible from a linguist's perspective because we will have "DuckTales", "DogTales", "PigTales", "ChickenTales"... as lexical items even though only "DuckTales" is used as a name for something in English, even though only "DuckTales" is a lexical item while the others are just composita whose meaning can be composed by combining the meanings of two items. From a computer scientist's perspective, this is horrible because we will have to catalog lexical items as if they were grammatical features, leading to a model that is unnecessarily large. We are talking about the difference between a few hundred grammatical features in a more well thought model versus millions of grammatical features in a less well thought one.

A better version of UDPipe is necessary, allowing people to have more grammatical features than 264 (1 byte), but this does not mean that we need millions of grammatical features and that we need/should accommodate compounds as if they were themselves lexical items. That is exactly what a compound is not and that is exactly the reason why we call them compounds.

foxik commented 3 years ago

@DanielCoutoVale I agree with you completely, sorry if I did not make myself clear :-)

However, I only have the HDT data, which are tokenized this way -- if someone can modify it to contain different tokenization and suitably change POS tags, morphological features, lemmas and labeled dependency trees, I will gladly train a different model -- but I do not expect it, as it would probably require a lot of manual work. That is what I mean by "we need to keep it" (we do not have gold finegrained tokenization and even if we did, we would also need the rest of the morphosyntactic features on it).

Cheers, Milan