ufal / lindat-corpora-conversions

LINDAT Corpora Conversions
2 stars 1 forks source link

attributes for corpora in conllu format #11

Closed natalink closed 5 years ago

natalink commented 8 years ago

The question for the discussion: which attributes should we use to make most of parsed corpora in lindat? Should we try to be more consistent with The Czech National Corpus (CNC) or should we have the attributes more close to the conllu format? So far, the CNC uses the following attributes:

word,lc,lemma,lemma_lc,tag,pos,case,proc,afun,parent,eparent,prep,p_lemma,p_tag,p_pos,p_case,p_afun,ep_lemma,ep_tag,ep_afun

Eparent (effective parent) is not in conllu, so we can ignore it. I suggest the following attributes for lindat corpora (UD + parsed with udpipe), I tried to be as close both to CNC attributes ordering and to conllu:

word,lc,lemma,lemma_lc,pos,ufeat,deprel,ord,p_word,p_lemma,p_pos,p_ufeat,p_deprel,p_ord,parent,children

There are several attributes that are specific to CNC and missing in our schema (case and proc), and the following attributes are added: ord, p_ord(underscore if there is no parent, e.g. for the root), children (ord of children nodes, e.g. 2|4|7, underscore if there are no children ). The latter attributes will allow to query for at least three levels of the tree: node, children and parent.

Example query 1 Search for the pairs verb+direct object in UD 1.3 . The query [pos="VERB"] [deprel="dobj"] within <s/> will give us 3,416 hits, among them the samples that we do not want, because dobj might be object of another verb. With referencing to order of nodes we can make it more precise: 1:[pos="VERB"] 2:[deprel="dobj"] & 1.ord=2.p_ord within <s/>

Example query 2 Search for the list of objects the verb run can take and sort them according to frequency: [p_lemma="run" & deprel="dobj"] in UD 1.3 with 204,586 positions: 17 hits, in parsed W2C (864,536,656 positions) - 85,439 hits

natalink commented 8 years ago

It takes much longer to execute queries with global conditions:

query UD 1.3 Web corpus
[p_lemma="run" & deprel="dobj"] 2.86s 8.36s
1:[pos="VERB"] 2:[deprel="dobj"] & 1.ord=2.p_ord 4.58s 1m59.24s