Open jeanm opened 5 years ago
Currently that would be non-trivial to do (just because how the implementation works).
There are two parts of the mentioned problem:
au
multi-word token must be split in two words à
and le
. Currently UDPipe does that in a very old-fashioned way by having a dictionary with rules how the multi-word tokens are split. It would not be difficult to allow adding additional rules, both during training or during inference.à
are the same independently whether it was a token or a part of a multi-word token -- but of course it could be modified.I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself.
There are some other issues that relate to this: #63 and #50
(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated
FORM,LEMMA,UPOS,XPOS,FEATS
format so that I can also use it with UDPipe.Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
It does seem awfully verbose though...