ufal / treex

Treex NLP framework
33 stars 6 forks source link

Fused words in Universal Dependencies #17

Open dan-zeman opened 9 years ago

dan-zeman commented 9 years ago

From https://github.com/ufal/lindat-corpora-conversions/issues/3#issuecomment-136326528 :

I think we need a better representation of fused tokens in Treex. Now it is just sketched using the wild attributes but it will probably be needed in future, as it is part of the UD guidelines. So we need a less wild solution. Once we have it, we could try to implement directly in Treex the heuristics that will collapse fused words whenever desirable. And once we have this, we should probably use it before exporting data for Kontext. Because the surface matters here.

martinpopel commented 9 years ago

I agree we need a better (less wild) API for fused (aka multi-word) tokens in Treex.

I am not sure how it will solve the problem in KonText, which probably can display either only tokens or only words. There are scripts distributed with UD (e.g. conllu-w2t.py) for converting the CoNLL-U word-indexed format to other formats.

See also http://universaldependencies.github.io/docs/cs/overview/tokenization.html http://universaldependencies.github.io/docs/u/overview/tokenization.html http://universaldependencies.github.io/docs/format.html#words-and-tokens