Closed vcvpaiva closed 7 years ago
The CoNLL-U format is documented here.
Notes:
column 5 is what is called a "language specific POS". However, in practice what we are seeing is that the corpora usually put the legacy POS tag that each corpus was originally annotated on there. Since we are using Parsey McParseface, we need to figure out the legacy tagset used in the corpus that was used to train. My guess would be the Penn Treebank POS tags.
We are not using enhanced dependencies (column 9).
Looks like the corpus that Parsey was trained on doesn't have morphological features explicitly defined, so column 6 will always be empty. If we used the UD corpus, this column would not be empty.
Column 10 is the MISC column, which is the catch all for this format and yes we are putting everything extra there. This is a big mess, but is a limitation of the format. I wish we had a better solution to this.
excellent, thanks a lot!
@fcbr many thanks! excellent.
I am trying to understand why in a simple example like:
# text = Someone is writing
+1 Someone someone NOUN NN _ 3 nsubj _ PRP|?|?
+2 is be VERB VBZ _ 3 aux _ VBZ|02604760-v|Entity+
+3 writing write VERB VBG _ 0 ROOT _ VBG|00993014-v|WrittenCommunication=
"someone" which should be a pronoun (PRP) gets tagged as noun (NN and NOUN) and then doesn't get a mapping. if it was a noun it should get the mapping, as the noun is in PWN.
finally, can you also shed some light on the need to 'disable' Freeling's compound module and what you know about the module itself?
@vcvpaiva concordo que esperarímos o mapping para http://wnpt.brlcloud.com/wn/synset?id=00007846-n . Mas o Freeling marcou como PRP (pronome, vide columna MISC) e logo não iria procurar com um noun na WN.Pr. A coluna 4 é a tag do SyntaxNet.
Sobre entender os erros, já falamos sobre isso. O parser, o tagger de Freeling são treinados. Em particular, as frases curtas do SICK não estão ajudando o parser que foi treinado com um corpus com frases mais longas, natureza bem distinta.
@vcvpaiva documentação do módulo compound em
https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/dictionary.html
@fcbr
We are not using enhanced dependencies (column 9).
why? because we cannot? or because we prefer not?
@arademaker just to make it abundantly clear,
Sobre entender os erros, já falamos sobre isso.
I do know how statistical systems work, no need to repeat it. the question here was, is it the disconnect between postags Parsey and FreeLing, or just the no-reason for tag in Freeling that causes the issue. it seems clear now that mappings to PWN come straight from Freeling (which makes important the discussion with @lluisp on having the full collection of PWN lemmas there, including stuff such as bored, animated, waterboard, jetski). but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.
Nothing stops us from using enhanced dependencies, but Parsey McParseface does not emit them, most likely because the corpus it was trained on doesn't contain them (neither does the UD corpus).
@vcvpaiva Pelo que entendi agora em do paper:
S. Schuster and C. D. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” pp. 1–8, Mar. 2016.
De fato poderiamos tentar reproduzir este paper (se ele contiver todas as regras ou apontar para onde elas estejam) para produzir enhanced dependencies a partir das depencencias básicas. Parece que eles usaram regras em Semgrex: http://nlp.stanford.edu/software/tregex.shtml
ainda não entendi a questão seguinte.
se ele contiver todas as regras ou apontar para onde elas estejam
Sim, seria muito bom poder usar as "enhanced dependencies" deles, infelizmente ele so' diz que as regras estao no CoreNLP, que e' um monstro de code... mas o paper 'e muito bom!
Qual 'e a questao seguinte? a abaixo?
but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.
a gente sabe que vai ter que remover pedacos dos mappings e tambem que vai ter que adicionar coisas a representacao conll, certo? por exemplo pra frase:
+# text = People are walking +1 People people NOUN NNS 3 nsubj NNS|07942152-n|GroupOfPeople= +2 are be VERB VBP 3 aux VBP|02604760-v|Entity+ +3 walking walk VERB VBG 0 ROOT VBG|01904930-v|Walking=
a gente quer "read-out" dessa dependencia a seguinte representacao:
subconcept-of (people2, GroupOfPeople) subconcept-of (walking1, Walking) subj(walking1,people2)
e pra segunda sentenca +# text = Someone is writing +1 Someone someone NOUN NN 3 nsubj PRP|?|? +2 is be VERB VBZ 3 aux VBZ|02604760-v|Entity+ +3 writing write VERB VBG 0 ROOT VBG|00993014-v|WrittenCommunication=
a gente quer read-out"
subconcept-of (writing1 , WrittenCommunication) subconcept-of (someone2 ,Person) subj(writing1, someone2)
dai que precisamos ter um mecanismo pra inventar mappings pra pronomes, ja' que o mecanismo Freeling so' funciona pra nouns,verbs,adjectives, adverbs. vamos precisar desse mecanismo pra todas as frase preposicionais que decidirmos usar. e eu propus que usassemos pelo menos as das Enhanced dependencies repetidas em #65
the validator spec is implicit from the rules for the representations (in (http://universaldependencies.org/format.html) and the error signals the validator emits (https://github.com/UniversalDependencies/tools/issues/20#issuecomment-284810923).
So the CoNLL-U rules format says:
Sentences consist of one or more word lines(no empty sentences), and word lines contain the following fields:
First the validator checks whether there are spurious empty lines (non-spurious empty lines are between reps of sentences) or spurious comment lines (#text) and counts the number of columns for word-lines, which needs to be ten.
The first 6 error codes in @martinp are about this:
"Spurious empty line.",u"Format"
"Spurious comment line.",u"Format"
"The line has %d columns, but %d (10) are expected."%(len(cols),COLCOUNT),u"Format"
"Spurious line: '\%s'. All non-empty lines should start with a digit or the # character."%(line),u"Format"
"Missing empty line after the last tree.",u"Format" (tree?)
"Spurious sent_id line: '%s' Should look like '# sent_id = xxxxxx' where xxxx is not whitespace. Forward slash reserved for special purposes." %c,u"Metadata"
the next 9 error codes are about metadata and I am not worrying about it.
please notice that validator.py
checks not only the CoNLL-U format, but also the UD tagset.
well, it takes as a parameter the UD tagset, the set of UD dependency relations DEPREL and the set of morphological features (whatever this may be, depending on the language). at least. if I am reading correctly the other error codes, e.g.
I had stopped analyzing the other error codes (at number 15), but I meant to continue checking them later tonight.
Absolutely, good reminder. I was thinking of the default behavior.
while what I wanted to know was if it would do more semantic things about the meaning of the dependencies. like saying that a ROOT can only be a main verb, an adj or a noun, in case the sentence is copular. that if a dependency is is dobj it should relate a verb to a noun-phrase (perhaps?, I don't know)
but I believe that there are more things that it could be saying...
@fcbr I am very interested in understanding how they check
Words do not form a sequence. Got: %s."%(u",".join(unicode(x) for x in words)),u"Format"
because most of our issues at the moment in the Bosque have to do with multiword tokens. and in SICK-SANE have to do with creating appropriate multiword tokens.
@arademaker as you can see, the representation
subconcept-of (people2, GroupOfPeople) subconcept-of (walking1, Walking) subj(walking1,people2)
is very similar to TIL (over SUMO), instead of TIL over Unified Lexicon, as we have with XLE.
The only thing missing is the context "true", which is not necessary in this simple case.
But the representation is also very close to OWL, as we have always triples and we pushed the difficult bits about quantifiers somewhere else. just like enhanced Dependencies do.
@fcbr have you tried to download and run UDPipe https://github.com/ufal/udpipe?
did you see the numbers in http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models they say they can do better em PT than in EN, which is crazy.
closing this issue as the representation is described in details in https://github.com/own-pt/cl-conllu
A sentence that works in SICK-SANE is
+# text = People are walking +1 People people NOUN NNS 3 nsubj NNS|07942152-n|GroupOfPeople= +2 are be VERB VBP 3 aux VBP|02604760-v|Entity+ +3 walking walk VERB VBG 0 ROOT VBG|01904930-v|Walking=
Using this as a model we could say that a conll-representation is a pair of a sentence (in the example People are walking) and a matrix M. The matrix M has always as many rows as the tokens/words in the sentence (in the example only 3 words/tokens). The matrix M always has ten columns.
is this correct?