The question for the discussion: which attributes should we use to make most of parsed corpora in lindat?
Should we try to be more consistent with The Czech National Corpus (CNC) or should we have the attributes more close to the conllu format?
So far, the CNC uses the following attributes:
Eparent (effective parent) is not in conllu, so we can ignore it. I suggest the following attributes for lindat corpora (UD + parsed with udpipe), I tried to be as close both to CNC attributes ordering and to conllu:
There are several attributes that are specific to CNC and missing in our schema (case and proc), and the following attributes are added: ord, p_ord(underscore if there is no parent, e.g. for the root), children (ord of children nodes, e.g. 2|4|7, underscore if there are no children ).
The latter attributes will allow to query for at least three levels of the tree: node, children and parent.
Example query 1 Search for the pairs verb+direct object in UD 1.3 . The query [pos="VERB"] [deprel="dobj"] within <s/> will give us 3,416 hits, among them the samples that we do not want, because dobj might be object of another verb. With referencing to order of nodes we can make it more precise: 1:[pos="VERB"] 2:[deprel="dobj"] & 1.ord=2.p_ord within <s/>
Example query 2 Search for the list of objects the verb run can take and sort them according to frequency: [p_lemma="run" & deprel="dobj"] in UD 1.3 with 204,586 positions: 17 hits, in parsed W2C (864,536,656 positions) - 85,439 hits
The question for the discussion: which attributes should we use to make most of parsed corpora in lindat? Should we try to be more consistent with The Czech National Corpus (CNC) or should we have the attributes more close to the conllu format? So far, the CNC uses the following attributes:
Eparent (effective parent) is not in conllu, so we can ignore it. I suggest the following attributes for lindat corpora (UD + parsed with udpipe), I tried to be as close both to CNC attributes ordering and to conllu:
There are several attributes that are specific to CNC and missing in our schema (case and proc), and the following attributes are added: ord, p_ord(underscore if there is no parent, e.g. for the root), children (ord of children nodes, e.g. 2|4|7, underscore if there are no children ). The latter attributes will allow to query for at least three levels of the tree: node, children and parent.
Example query 1 Search for the pairs verb+direct object in UD 1.3 . The query
[pos="VERB"] [deprel="dobj"] within <s/>
will give us 3,416 hits, among them the samples that we do not want, because dobj might be object of another verb. With referencing to order of nodes we can make it more precise:1:[pos="VERB"] 2:[deprel="dobj"] & 1.ord=2.p_ord within <s/>
Example query 2 Search for the list of objects the verb run can take and sort them according to frequency:
[p_lemma="run" & deprel="dobj"]
in UD 1.3 with 204,586 positions: 17 hits, in parsed W2C (864,536,656 positions) - 85,439 hits