what's the spec of a conll representation?

vcvpaiva commented 7 years ago

A sentence that works in SICK-SANE is

+# text = People are walking +1 People people NOUN NNS 3 nsubj NNS|07942152-n|GroupOfPeople= +2 are be VERB VBP 3 aux VBP|02604760-v|Entity+ +3 walking walk VERB VBG 0 ROOT VBG|01904930-v|Walking=

Using this as a model we could say that a conll-representation is a pair of a sentence (in the example People are walking) and a matrix M. The matrix M has always as many rows as the tokens/words in the sentence (in the example only 3 words/tokens). The matrix M always has ten columns.

column 1 is just a numbering of the tokens.
column 2 is the word as originally given
column 3 is the word of the sentence lowercased and lemmatized (walking ==> walk)
column 4 is the POS of the word in universal POS?
column 5 is the POS in Freeling's POS? (what's the set of labels? VBG=verb gerund, VBP aux?)
column 6 is ALWAYS empty?
column 7 is an encoding of the dependency tree?
column 8 is the collection of labels of the dependency tree
column 9 is always EMPTY?
column 10 is the processing produced by Freeling, including the mapping into PWN and SUMO.

is this correct?

fcbr commented 7 years ago

The CoNLL-U format is documented here.

Notes:

column 5 is what is called a "language specific POS". However, in practice what we are seeing is that the corpora usually put the legacy POS tag that each corpus was originally annotated on there. Since we are using Parsey McParseface, we need to figure out the legacy tagset used in the corpus that was used to train. My guess would be the Penn Treebank POS tags.
We are not using enhanced dependencies (column 9).
Looks like the corpus that Parsey was trained on doesn't have morphological features explicitly defined, so column 6 will always be empty. If we used the UD corpus, this column would not be empty.
Column 10 is the MISC column, which is the catch all for this format and yes we are putting everything extra there. This is a big mess, but is a limitation of the format. I wish we had a better solution to this.

vcvpaiva commented 7 years ago

excellent, thanks a lot!

vcvpaiva commented 7 years ago

@fcbr many thanks! excellent.

I am trying to understand why in a simple example like:

# text = Someone is writing
 +1 Someone someone NOUN    NN  _   3   nsubj   _   PRP|?|?
 +2 is  be  VERB    VBZ _   3   aux _   VBZ|02604760-v|Entity+
 +3 writing write   VERB    VBG _   0   ROOT    _   VBG|00993014-v|WrittenCommunication=

"someone" which should be a pronoun (PRP) gets tagged as noun (NN and NOUN) and then doesn't get a mapping. if it was a noun it should get the mapping, as the noun is in PWN.

finally, can you also shed some light on the need to 'disable' Freeling's compound module and what you know about the module itself?

arademaker commented 7 years ago

@vcvpaiva concordo que esperarímos o mapping para http://wnpt.brlcloud.com/wn/synset?id=00007846-n . Mas o Freeling marcou como PRP (pronome, vide columna MISC) e logo não iria procurar com um noun na WN.Pr. A coluna 4 é a tag do SyntaxNet.

Sobre entender os erros, já falamos sobre isso. O parser, o tagger de Freeling são treinados. Em particular, as frases curtas do SICK não estão ajudando o parser que foi treinado com um corpus com frases mais longas, natureza bem distinta.

arademaker commented 7 years ago

@vcvpaiva documentação do módulo compound em

https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/dictionary.html

vcvpaiva commented 7 years ago

@fcbr

We are not using enhanced dependencies (column 9).

why? because we cannot? or because we prefer not?

@arademaker just to make it abundantly clear,

Sobre entender os erros, já falamos sobre isso.

I do know how statistical systems work, no need to repeat it. the question here was, is it the disconnect between postags Parsey and FreeLing, or just the no-reason for tag in Freeling that causes the issue. it seems clear now that mappings to PWN come straight from Freeling (which makes important the discussion with @lluisp on having the full collection of PWN lemmas there, including stuff such as bored, animated, waterboard, jetski). but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

fcbr commented 7 years ago

Nothing stops us from using enhanced dependencies, but Parsey McParseface does not emit them, most likely because the corpus it was trained on doesn't contain them (neither does the UD corpus).

arademaker commented 7 years ago

@vcvpaiva Pelo que entendi agora em do paper:

S. Schuster and C. D. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” pp. 1–8, Mar. 2016.

De fato poderiamos tentar reproduzir este paper (se ele contiver todas as regras ou apontar para onde elas estejam) para produzir enhanced dependencies a partir das depencencias básicas. Parece que eles usaram regras em Semgrex: http://nlp.stanford.edu/software/tregex.shtml

ainda não entendi a questão seguinte.

vcvpaiva commented 7 years ago

se ele contiver todas as regras ou apontar para onde elas estejam

Sim, seria muito bom poder usar as "enhanced dependencies" deles, infelizmente ele so' diz que as regras estao no CoreNLP, que e' um monstro de code... mas o paper 'e muito bom!

Qual 'e a questao seguinte? a abaixo?

but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

a gente sabe que vai ter que remover pedacos dos mappings e tambem que vai ter que adicionar coisas a representacao conll, certo? por exemplo pra frase:

+# text = People are walking +1 People people NOUN NNS 3 nsubj NNS|07942152-n|GroupOfPeople= +2 are be VERB VBP 3 aux VBP|02604760-v|Entity+ +3 walking walk VERB VBG 0 ROOT VBG|01904930-v|Walking=

a gente quer "read-out" dessa dependencia a seguinte representacao:

subconcept-of (people2, GroupOfPeople) subconcept-of (walking1, Walking) subj(walking1,people2)

e pra segunda sentenca +# text = Someone is writing +1 Someone someone NOUN NN 3 nsubj PRP|?|? +2 is be VERB VBZ 3 aux VBZ|02604760-v|Entity+ +3 writing write VERB VBG 0 ROOT VBG|00993014-v|WrittenCommunication=

a gente quer read-out"

subconcept-of (writing1 , WrittenCommunication) subconcept-of (someone2 ,Person) subj(writing1, someone2)

dai que precisamos ter um mecanismo pra inventar mappings pra pronomes, ja' que o mecanismo Freeling so' funciona pra nouns,verbs,adjectives, adverbs. vamos precisar desse mecanismo pra todas as frase preposicionais que decidirmos usar. e eu propus que usassemos pelo menos as das Enhanced dependencies repetidas em #65

vcvpaiva commented 7 years ago

the validator spec is implicit from the rules for the representations (in (http://universaldependencies.org/format.html) and the error signals the validator emits (https://github.com/UniversalDependencies/tools/issues/20#issuecomment-284810923).

So the CoNLL-U rules format says:

Sentences consist of one or more word lines(no empty sentences), and word lines contain the following fields:

ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
FORM: Word form or punctuation symbol.
LEMMA: Lemma or stem of word form.
UPOSTAG: Universal part-of-speech tag.
XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD: Head of the current word, which is either a value of ID or zero (0).
DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
MISC: Any other annotation.

First the validator checks whether there are spurious empty lines (non-spurious empty lines are between reps of sentences) or spurious comment lines (#text) and counts the number of columns for word-lines, which needs to be ten.

The first 6 error codes in @martinp are about this:

"Spurious empty line.",u"Format"
"Spurious comment line.",u"Format"
"The line has %d columns, but %d (10) are expected."%(len(cols),COLCOUNT),u"Format"
"Spurious line: '\%s'. All non-empty lines should start with a digit or the # character."%(line),u"Format"
"Missing empty line after the last tree.",u"Format" (tree?)
"Spurious sent_id line: '%s' Should look like '# sent_id = xxxxxx' where xxxx is not whitespace. Forward slash reserved for special purposes." %c,u"Metadata"

vcvpaiva commented 7 years ago

the next 9 error codes are about metadata and I am not worrying about it.

fcbr commented 7 years ago

please notice that validator.py checks not only the CoNLL-U format, but also the UD tagset.

vcvpaiva commented 7 years ago

well, it takes as a parameter the UD tagset, the set of UD dependency relations DEPREL and the set of morphological features (whatever this may be, depending on the language). at least. if I am reading correctly the other error codes, e.g.

Morphological features must be sorted: '%s'"%feats,u"Morpho"
"Spurious morphological feature: '%s'. Should be of the form attribute=value and must start with [A-Z0-9] and only contain [A-Za-z0-9]."%f,u"Morpho"
"Repeated feature values are disallowed: %s"%feats,u"Morpho"
"If an attribute has multiple values, these must be sorted as well: '%s'"%f,u"Morpho"
"Incorrect value '%s' in '%s'. Must start with [A-Z0-9] and only contain [A-Za-z0-9]."%(v,f),u"Morpho"
"Unknown attribute-value pair %s=%s"%(attr,v),u"Morpho"
"Repeated features are disallowed: %s"%feats, u"Morpho"
"Unknown UPOS tag: %s"%cols[UPOSTAG],u"Morpho"
"Unknown XPOS tag: %s"%cols[XPOSTAG],u"Morpho"
"Unknown UD DEPREL: %s"%cols[DEPREL],u"Syntax"
"Malformed head:deprel pair '%s'"%head_deprel,u"Syntax"
"Unknown dependency relation '%s' in '%s'"%(deprel,head_deprel),u"Syntax"

I had stopped analyzing the other error codes (at number 15), but I meant to continue checking them later tonight.

fcbr commented 7 years ago

Absolutely, good reminder. I was thinking of the default behavior.

vcvpaiva commented 7 years ago

while what I wanted to know was if it would do more semantic things about the meaning of the dependencies. like saying that a ROOT can only be a main verb, an adj or a noun, in case the sentence is copular. that if a dependency is is dobj it should relate a verb to a noun-phrase (perhaps?, I don't know)

but I believe that there are more things that it could be saying...

vcvpaiva commented 7 years ago

@fcbr I am very interested in understanding how they check

Words do not form a sequence. Got: %s."%(u",".join(unicode(x) for x in words)),u"Format"

because most of our issues at the moment in the Bosque have to do with multiword tokens. and in SICK-SANE have to do with creating appropriate multiword tokens.

vcvpaiva commented 7 years ago

@arademaker as you can see, the representation

subconcept-of (people2, GroupOfPeople) subconcept-of (walking1, Walking) subj(walking1,people2)

is very similar to TIL (over SUMO), instead of TIL over Unified Lexicon, as we have with XLE.

The only thing missing is the context "true", which is not necessary in this simple case.

But the representation is also very close to OWL, as we have always triples and we pushed the difficult bits about quantifiers somewhere else. just like enhanced Dependencies do.

vcvpaiva commented 7 years ago

@fcbr have you tried to download and run UDPipe https://github.com/ufal/udpipe?

did you see the numbers in http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models they say they can do better em PT than in EN, which is crazy.

vcvpaiva commented 7 years ago

closing this issue as the representation is described in details in https://github.com/own-pt/cl-conllu

own-pt / rte-sick

what's the spec of a conll representation? #72