ufal / morphodita

MorphoDiTa: Morphologic Dictionary and Tagger
Mozilla Public License 2.0
69 stars 7 forks source link

MorphoDiTa returns empty lemma #10

Closed Karryanna closed 4 years ago

Karryanna commented 5 years ago

I've found out morphological analysis might return empty lemma for some forms. Is this a weird feature, or is it a bug? For me, it was completely unexpected and I could not find any note on this even a posteriori.

An example form set which, using Czech Morfflex PDT from 15th Nov 2016, demonstrates this is ['Řekni', 'I.', 'ovi', '.']. The resulting lemmata are ['říci', 'i.', '', '.']. All tokens have a reasonable tag, though.

I use Python bindings with MorphoDiTa installed via pip in version 1.9.2.1. When using MorphoDiTa as a Lindat service, it gets more interesting. When giving 'Řekni I. ovi.' as a plain text, it splits 'I.' into two tokens but the 'ovi' again gets lemma-less. Giving it in vertical format leads to getting a lemma even for the 'ovi' part, though. However, if the sentences turns into 'Daří se I. ovi.' and is provided in vertical format, the empty lemma problem arises again.

(Actually, I would agree that the text does not seem to be tokenized correctly and that the middle two forms should come together, however I've got hundreds of similar sentences in my corpus (ČNK syn v4, if interested) and as the corpus comes pre-tokenized, it seems better to go with the existing tokenization, no matter how imperfect it might be. Anyway, as I've already said, the behaviour is completely unexpectable for me and caused introducing some bugs in my data.)

foxik commented 4 years ago

This has been fixed in 2dd99cd5bd in 2017, but there has not been an official release since, which is why the official binaries and the service still suffers from it. The release coming though (it will happen next week), so closing.