ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
355 stars 74 forks source link

Morphological dictionary and multi-word tokens #99

Open jeanm opened 5 years ago

jeanm commented 5 years ago

(First of all, congrats on UDPipe, it's a pleasure to use!)

I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.

Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:

1-2 au _ _ _ _
1 à à ADP ADP _
2 le le DET DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art

If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?

I was thinking of something like the following:

au   _   _   _   _   SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art

It does seem awfully verbose though...

foxik commented 5 years ago

Currently that would be non-trivial to do (just because how the implementation works).

There are two parts of the mentioned problem:

  1. The au multi-word token must be split in two words à and le. Currently UDPipe does that in a very old-fashioned way by having a dictionary with rules how the multi-word tokens are split. It would not be difficult to allow adding additional rules, both during training or during inference.
  2. Run morphological analysis on the resulting words. UDPipe currently does not distinguish tokens and multi-word tokens, so the analyses for à are the same independently whether it was a token or a part of a multi-word token -- but of course it could be modified.

I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself.

ftyers commented 4 years ago

There are some other issues that relate to this: #63 and #50