ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
364 stars 77 forks source link

Universal dictionary with information about the world #80

Open msklvsk opened 6 years ago

msklvsk commented 6 years ago

If there is (for example) a valency dictionary, one can tag each verb in the gold standard with valency, train the parser using that additional annotation, and then provide the dictionary at the inference stage so that the parser can take better, more informed decisions — like UDPipe already does with a morphological dictionary. I wonder if putting everything into the FEATS column isn’t suboptimal. Should there be a dedicated way to aid the parser with additional non-morphological annotation or using FEATS should suffice? What if one does not have a morpho dict but has a valency dict?

foxik commented 6 years ago

That is interesting idea. Currently UDPipe can utilize only some columns in the CoNLL-U file, so using FEATS is now probably the only possibility. But as you say, it is suboptimal, expecially since we consider FEATS as a whole instead of being able to look at individual features.

So either we could implement utilizing individual features from FEATS (which we should anyway), or support explicit "external" knowledge (i.e., a mapping from FORM (or maybe any other column) to a value, which is passed to the tagger/parser/...)).

I will be improving support for morphological dictionary in several months (because currently it needs to be specified during training and is embedded in the model; we want to be able to utilize any given dictionary during inference, and I wanted to add support for providing only some of the columns). Maybe during the rewrite I could generalize the dictionary to provide also "additional" columns (like valency), which would be passed to tagger/lemmatizer/parser. I will think about it, and I am leaving this open as a remainder.

msklvsk commented 6 years ago

A fun example

You can provide a dictionary of average lengths of objects. The parser will deep-learn that bigger objects rarely are in smaller ones, which should help to disambiguate e.g. classical Alice drove down the street in her car.