ufal / lindat-kontext

An alternative web front-end for the Manatee corpus search engine
GNU General Public License v2.0
5 stars 1 forks source link

syntax view: adjust for fused tokens that exist in the UD corpora #238

Open Ansa211 opened 5 years ago

Ansa211 commented 5 years ago

Fused tokens are a feature of UD data. An example from the Spanish GSD data (https://lindat.mff.cuni.cz/services/kontext/view?q=~rMfUdbfRKQK6):

Original UD data:

2-4 Nótese  _   _   _   _   _   _   _   _
2   Nó  nó  VERB    _   Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   0   root    _   _
3   te  tú  PRON    _   Case=Acc,Dat|Number=Sing|Person=2|PrepCase=Npr|PronType=Prs 2   iobj    _   _
4   se  él  PRON    _   Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes  2   iobj    _   _

which is converted to (except for order of attributes; additional attributes are copied from parents to children etc.):

2-3-4   Nótese  nó|tú|él    VERB|PRON   _   Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Case=Acc,Dat|Person=2|PrepCase=Npr|PronType=Prs|Reflex=Yes    0|2 root|iobj   _   _

This issue asks for an update of the syntax viewer so that it can handle the fact that fused tokens have multiple dependencies.

ÚČNK has developed / is developing their own solution because they will be using UD-pipe as the default parser for Intercorp, so maybe we could rely on their solution.

Another option would be to add a computed attributed that would extract a single parent for the whole fused token - often, there is a single "external" parent and the remaining parts of the fused tokens depend on the head of the internal structure of the fused token, as is the case in the above example.

Ansa211 commented 5 years ago

A token with two external dependencies (thanks to A. Rosen). In their approach, the syntax tree is constructed from the syntactic words (individual parts that are fused together to form the ortographic word).

<s id="2.10" text="Ale měla by sis uspořádat život.">
1   Ale Ale Ale ale CCONJ   J^------------- _   2   cc  _   _
2   měla    měla    měla    mít VERB    VpFS----R-AA--- Gender=Fem|Number=Sing|Polarity=Pos|Tense=Past|VerbForm=Part|Voice=Act  0   root    _   _
3   by  by  by  být AUX Vc------------- Mood=Cnd|VerbForm=Fin   2   aux _   _
4-5 sis 1:|si|2:|s  1:|si|2:|jsi    1:|se|2:|být    1:|PRON|2:|AUX  1:|P7-S3--2-------|2:|VB-S---2P-AA---   1:|Case=Dat|Number=Sing|PronType=Prs|Reflex=Yes|Variant=Short|2:|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act   1:|6|2:|2   1:|obl|2:|aux   _   _
6   uspořádat   uspořádat   uspořádat   uspořádat   VERB    Vf--------A---- Polarity=Pos|VerbForm=Inf   2   xcomp   _   _
7   život   život   život   život   NOUN    NNIS4-----A---- Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos  6   obj _   SpaceAfter=No
8   .   .   .   .   PUNCT   Z:------------- _   2   punct   _   SpacesAfter=\r\n
</s>