Open Ansa211 opened 5 years ago
A token with two external dependencies (thanks to A. Rosen). In their approach, the syntax tree is constructed from the syntactic words (individual parts that are fused together to form the ortographic word).
<s id="2.10" text="Ale měla by sis uspořádat život.">
1 Ale Ale Ale ale CCONJ J^------------- _ 2 cc _ _
2 měla měla měla mít VERB VpFS----R-AA--- Gender=Fem|Number=Sing|Polarity=Pos|Tense=Past|VerbForm=Part|Voice=Act 0 root _ _
3 by by by být AUX Vc------------- Mood=Cnd|VerbForm=Fin 2 aux _ _
4-5 sis 1:|si|2:|s 1:|si|2:|jsi 1:|se|2:|být 1:|PRON|2:|AUX 1:|P7-S3--2-------|2:|VB-S---2P-AA--- 1:|Case=Dat|Number=Sing|PronType=Prs|Reflex=Yes|Variant=Short|2:|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act 1:|6|2:|2 1:|obl|2:|aux _ _
6 uspořádat uspořádat uspořádat uspořádat VERB Vf--------A---- Polarity=Pos|VerbForm=Inf 2 xcomp _ _
7 život život život život NOUN NNIS4-----A---- Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos 6 obj _ SpaceAfter=No
8 . . . . PUNCT Z:------------- _ 2 punct _ SpacesAfter=\r\n
</s>
Fused tokens are a feature of UD data. An example from the Spanish GSD data (https://lindat.mff.cuni.cz/services/kontext/view?q=~rMfUdbfRKQK6):
Original UD data:
which is converted to (except for order of attributes; additional attributes are copied from parents to children etc.):
This issue asks for an update of the syntax viewer so that it can handle the fact that fused tokens have multiple dependencies.
ÚČNK has developed / is developing their own solution because they will be using UD-pipe as the default parser for Intercorp, so maybe we could rely on their solution.
Another option would be to add a computed attributed that would extract a single parent for the whole fused token - often, there is a single "external" parent and the remaining parts of the fused tokens depend on the head of the internal structure of the fused token, as is the case in the above example.