ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
361 stars 76 forks source link

Tagging of words that end in a digit, e.g. Boeing777 #101

Open anatoleg opened 5 years ago

anatoleg commented 5 years ago

The tagger treats the words that end in a digit as numbers assigning them upos NUM. That causes incorrect tagging of other words in a phrase and incorrect parsing, especially in languages with cases, such as Russian. Is there any way to fix this and make the tagger tag such words as NOUNs? Just fixing the output of the tagger for this word does not change the incorrect case features on other words.

foxik commented 5 years ago

Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe.

A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###).

BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word.

anatoleg commented 5 years ago

Thank you very much for the response. We are using a similar hack and are eagerly waiting for the next release when we will be able to discard it. The (b) point in your response is particularly intriguing since it can potentially fix a number of current problems. For example, “departs” in “airplane departs from Prague” is tagged as NOUN, which, needless to say, causes wrong parses. If we could specify that “departs” is a NOUN for the tagger, it should solve this problem. This facility should be extended to the features as well as upos. For example, in Russian, the word “сбит” (shot down) in “самолет сбит ракетой” (airplane shot down by a missile), is correctly tagged as VERB but in the wrong voice - active instead of passive. The makes the airplane “nsubj” instead of an object during parsing. In general, a statistical system will inevitably make mistakes and a facility to correct them without creating a new training set would be very welcome. When can we expect UDPipe 2.0?

On Jun 10, 2019, at 9:00 AM, Milan Straka notifications@github.com wrote:

Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe.

A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###).

BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ufal/udpipe/issues/101?email_source=notifications&email_token=ABH5KRAWPZLUNRUEIHBMN5DPZZF7BA5CNFSM4HWTB7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXJZY4Q#issuecomment-500407410, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH5KRA7TH5KKWANVYVXCVLPZZF7BANCNFSM4HWTB7VQ.

foxik commented 5 years ago

As for the release, I cannot unfortunately make any promises -- I am teaching a lot and doing research, so the software work is currently not high-priority for me. I hope to have an inference-only prototype in summer, but without changing the API. Then I want to also support training and changing the API to support the b) point -- but we are talking about Q4 of 2019.

AleksandrsBerdicevskis commented 4 years ago

Just a comment on this issue: such words are being tagged as NUM even if I am using my own model, trained on a not-exactly-UD input that does not have the NUM tag at all.

foxik commented 4 years ago

Yeah, the NUM is hardcoded, together with PUNCT and SYM: https://github.com/ufal/udpipe/blob/31f0b8c77a7a68017e60726ed385227db673315b/src/trainer/trainer_morphodita_parsito.cpp#L629-L631

Should be improved with the (still not released) next version...