ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

method to plug a morpological guesser #34

Closed nittti closed 6 years ago

nittti commented 7 years ago

UDPipe 1.1 made it possible to use a morphological dictionary, which dramatically improves accuracy. To improve even further, is there an "official" way to plug a custom rule-based morphological guesser? (which itself uses dict internally to provide possible interpretations at runtime) Both external and compile-in solutions are interesting.

foxik commented 7 years ago

There currently is not any "official" way to plug in a guesser. Internally, a rule-based guesser is computed automatically, even if you provide a dictionary. The rule-based guesser uses suffixes of length 1-4 and optionally a fixed set of prefixes (at most guesser_prefixes_max which appear as a suffix at least guesser_prefix_min_count times), and is trained using the UD training data to return at most guesser_suffix_rules most-frequent analyses.

One way how a custom guesser could be specified would be to give the rules for the above described guesser; but that would not allow to internaly use the dict. As for the external solution, I would imagine that the UDPipe model could have no morphological analyzer at all and the analyses would be passed on input -- this is how MorphoDiTa does it (which is internally used to perform the tagging and lemmatizing).

Frankly speeking, I am not going to add this for the time being, because I have other issues I consider more pressing. But you are welcome to implement this, of course :-)

arademaker commented 7 years ago

Hi,

I am having bad performance with a corpus from a technical domain that contains many names and MWE from the domain. Is it possible to supply a dictionary for names and MWE with the goal to improve the tagger and the parser?

The tagger options described in http://ufal.mff.cuni.cz/udpipe/users-manual#model_training_tagger can also be using during parsing time or only during training time?

foxik commented 7 years ago

Currently the options can be used only during training time. You are right that in theory morphological dictionary could be extended at parsing time, but that is not allowed by the current implementation.

arademaker commented 7 years ago

Hi @foxik , the option that you gave in https://github.com/ufal/udpipe/issues/7#issuecomment-243210062 is still possible? I mean, the two steps call rewriting the POS tags before the parsing ?

foxik commented 7 years ago

The mentioned comment was just a discussion of ideas -- it does not work automatically. The only way how it would work currently is that you would tag the sentence (either through the API or udpipe binary with --tag), then manually fix the problematic tags & lemmas, and then parse the sentence.

foxik commented 6 years ago

Closing. I doubt we will have an official API for morphological guesser any time soon; the possibility of passing a morphological dictionary at inference time is tracked by #50.