multi-word tokens - Githubissues

Ansa211 commented 6 years ago

They are present in the Spanish and French data:

https://lindat.mff.cuni.cz/services/kontext-staging/ansa/view?ctxattrs=word&attr_vmode=visible&pagesize=40&refs=%3Ddoc.id&q=~bPclKIrP&viewmode=kwic&attrs=word&corpname=parseme_es_a&structs=p%2Cg%2Cerr%2Ccorr&attr_allpos=kw

Ansa211 commented 6 years ago

Also, the German data contains beauties like this:

languagerecipes commented 6 years ago

What is the problem, could you please elaborate in the title?

Ansa211 commented 6 years ago

A single multi-word token (like Spanish del=de+el) is listed as three tokens (del de el); the conversion script is not aware of the existence of this type of items in the .conllu format.

languagerecipes commented 6 years ago

I am little confused; I can just guess that you mean there is a bug in the conversion code, @natalink are you checking for ids that contain ``-'', e.g.,

3-4 del _   _
3   de  _   _
4   el  _   _

the line 3-4 del _ _ should not appear in the vertical file.

Ansa211 commented 6 years ago

Some observations that are relevant to the necessary changes in the code:

in the Spanish data, a few multi-word tokens are part of an MWE; the annotation may be marked either on the multi-word token or on its subparts, so that in ES/train.parsemetsv, there is

4-5   al      _       1
4     a       _       _
5     el      _       _

and also

15-16 pronunciarse    _       _
15    pronunciar      _       1:IReflV
16    se      _       1

In the .conllu data, the multi-word token itself has only the word form, all other fields are filled with "_", except for a few cases of SpaceAfter=No in the ES/test.conllu file (which should be addressed in code solving #14):

7-8      hacerlo _       _       _       _       _       _       _       SpaceAfter=No
7        hacer   hacer   VERB    _       VerbForm=Inf    10      acl     _       _
8        lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     7       dobj    _       _

The implemented solution should be the same as that applied to converting the UD data (https://github.com/ufal/lindat-corpora-conversions/issues/3): a single word will be included in the output vertical format, with its form equal to the form of the multi-word token; all remaining attributes will be multivalue attributes taking values of both parts of the multi-word token. In case of the mwe attributes, the most specific value will be taken (in other words, the whole multi-word token will be annotated as an MWE whenever at least one of its parts is annoted as such).

natalink commented 6 years ago

I tried to tackle this problem some time ago, but because I was working on that alone and received no feedback on whether I was doing right or not, I just left it to "future generation" (Ansa). Just a copy-paste from that code to illustrate what it WAS doing:

###########input###########
# testfile sent_id  2
#1       Gas     gas     NOUN    S       Gender=Masc     0       root    _       _
#2-3     dalla   _       _       _       _       _       _       _       _
#2       da      da      ADP     E       _       4       case    _       _
#3       la      il      DET     RD      Definite=Def|Gender=Fem|Number=Sing|PronType=Art        4       det     _    _
#4       statua  statua  NOUN    S       Gender=Fem|Number=Sing  1       nmod    _       _
#5       .       .       PUNCT   FS      _       1       punct   _      ~

#If the corpus is compiled in Manatee as above, the (2-3) token will be ignored by the indexer, and the user therefore will not be able to find
# the lemma "dalla" itself. 

###########output#############
# testfile sent_id  2
#1       Gas     gas     NOUN    S       Gender=Masc     0       root    _       _
#2       dalla   da|il   ADP+DET E+RD    _+Definite=Def|Gender=Fem|Number=Sing|PronType=Art      3       case+det     _       _+_
#3       statua  statua  NOUN    S       Gender=Fem|Number=Sing  1       nmod    _       _
#4       .       .       PUNCT   FS      _       1       punct   _       _

# The query for the whole word: [word="dalla"], lemmas will be processed as multivalues, so both the queries [lemma="da"] and
#[lemma="il"] are valid. Still this solution is not full as we have to process ufeats, deprels and other attributes in a better manner. How?

+150000 for Ansa's idea to represent other attributes as multivalue. As far as I remember I made it more complicated because e.g. in case of hacerlo , the POS should be VERB, and it will be VERB|PRON. But then we really have to work out some rules what is the 'main' word, so it is easier just to make it that simple as Ansa suggested. Also beware that ufeats already have that separator '|', so it will be just a bag of features for e.g. a verb and a pronoun.

Ansa211 commented 6 years ago

I also use multivalue ids, because that way, I do not even have to solve the issue of renumbering the tokens (which is one of the things Natalia's original solution had to deal with); the multi-word token simply has multi-value "id" 2-3, and all the remaining tokens keep their "head" value unchanged. Also the multi-word token itself may have a multi-value "head" (one of its parts usually depends on the other of its parts). But wherever the value is the same for both subwords, I keep it just once; and if one of the subwords has an empty value of an attribute and the other one not, I keep just the filled in value:

#2-3       dalla   da|il   ADP+DET E+RD    Definite=Def|Gender=Fem|Number=Sing|PronType=Art      4       case+det     _       _

In haces such hacerlo, we know which word is the head of the multi-word token (if one word depends on the other, it is the child), and we could keep POS, head, deprel and maybe some other attributes of the head only; but dalla above shows that sometimes there is no head; also, I think that if someone searches for PRON, it is good that the match includes the PRONs that are glued to verbs and so the user is reminded that they have to consider them also.

natalink / mwe_noske

multi-word tokens #13