Open msklvsk opened 6 years ago
That is interesting idea. Currently UDPipe can utilize only some columns in the CoNLL-U file, so using FEATS
is now probably the only possibility. But as you say, it is suboptimal, expecially since we consider FEATS as a whole instead of being able to look at individual features.
So either we could implement utilizing individual features from FEATS (which we should anyway), or support explicit "external" knowledge (i.e., a mapping from FORM (or maybe any other column) to a value, which is passed to the tagger/parser/...)).
I will be improving support for morphological dictionary in several months (because currently it needs to be specified during training and is embedded in the model; we want to be able to utilize any given dictionary during inference, and I wanted to add support for providing only some of the columns). Maybe during the rewrite I could generalize the dictionary to provide also "additional" columns (like valency), which would be passed to tagger/lemmatizer/parser. I will think about it, and I am leaving this open as a remainder.
You can provide a dictionary of average lengths of objects. The parser will deep-learn that bigger objects rarely are in smaller ones, which should help to disambiguate e.g. classical Alice drove down the street in her car.
If there is (for example) a valency dictionary, one can tag each verb in the gold standard with valency, train the parser using that additional annotation, and then provide the dictionary at the inference stage so that the parser can take better, more informed decisions — like UDPipe already does with a morphological dictionary. I wonder if putting everything into the
FEATS
column isn’t suboptimal. Should there be a dedicated way to aid the parser with additional non-morphological annotation or usingFEATS
should suffice? What if one does not have a morpho dict but has a valency dict?