ixa-ehu / ixa-pipe-pos

IXA pipes Part of Speech tagger and Lemmatizer (http://ixa2.si.ehu.es/ixa-pipes)
Apache License 2.0
17 stars 15 forks source link

question: tagset #7

Closed jgrivolla closed 7 years ago

jgrivolla commented 7 years ago

Hi, for the non-UD models for Spanish are you using the standard Ancora (EAGLES) tagset or are there any modifications?

ragerri commented 7 years ago

Hello, The Ancora corpus tagset as it is. No modifications.

R

jgrivolla commented 7 years ago

Thanks.

jgrivolla commented 7 years ago

At least with the older models from http://ixa2.si.ehu.es/ixa-pipes/models/pos-resources.tgz#es-pos-maxent-700-c0-b3.bin the tags are generated in upper case, whereas standard Ancora has lower case tags (see the discussion in https://github.com/dkpro/dkpro-core/pull/1071). Could you clarify?

ragerri commented 7 years ago

http://clic.ub.edu/corpus/en/ancora-descarregues

Ancora es dep 2.0.0 tags are in upper case. I use the treebank (what you call "standard") only for parsing, for NER and POS and lemmatization I train with the dep-2.0.0 corpus because it is easier to format. For those three tasks the annotations are equivalent, just the syntax is different.

R

jgrivolla commented 7 years ago

I don't have the treebank here right now so I can't check, but I see that in "AnCora: Multilevel Annotated Corpora for Catalan and Spanish" the examples contain both upper- and lower-case tags. Weird.

ragerri commented 7 years ago

Just uppercase everything or lowercase it. In the treebank the tags are lowercased. In the dep version uppercase. I do not think it is that important as long as the tagset is the same, which seems to be. Or do you have evidence to say that the tagsets in dep and treebank are different? That could be interesting :)