ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
355 stars 74 forks source link

OOV Word Forms #90

Open vladob54 opened 5 years ago

vladob54 commented 5 years ago

I could quite appreciate if udpipe indicated somehow that a respective word form was not present in the morphological lexicon, i.e., its lemma, PoS and features have been guessed, This type of information is provided, e.g,, by TreeTagger and we make use of it while post-processing the tagger output, and also provide it to corpus users so that they can incorporate the respective attribute into their CQL queries...

Best, Vlado B, 10:45

http://unesco.uniba.sk/guest/

arademaker commented 5 years ago

Indeed , nice suggestion

foxik commented 5 years ago

Currently it is not straightforward to implement this, because current UDPipe does not distinguish "real" morphological lexicon and guesser rules derived from the training data. (Our MorphoDiTa tool can do it, there we keep this distinction.)

BTW, if you have a morphological dictionary, you can perform the required operation manually after running UDPipe.

Also, the future UDPipe 2.0 will allow explicitly passing morphological dictionary (during inference, not just during training), so it will then be possible to indicate which words were processed just by a "guesser".

Leaving the issue open as a reminder.

ftyers commented 5 years ago

This is relevant to #50 too.