kulukimak / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Brown Tag Set is not properly loaded by POS Tagger #415

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

I am trying to use the Brown mapping, which I downloaded from here:

https://code.google.com/p/dkpro-core-asl/source/browse/de.tudarmstadt.ukp.dkpro.
core-asl/trunk/de.tudarmstadt.ukp.dkpro.core.api.lexmorph-asl/src/main/resources
/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/tagset/en-brown-pos.map?spec=svn2017
&r=2017

My pipeline looks like the following:

//Brown Corpus
        CollectionReaderDescription brownCorpus = CollectionReaderFactory.createReaderDescription(
                TeiReader.class,
                TeiReader.PARAM_LANGUAGE, "en",
                TeiReader.PARAM_SOURCE_LOCATION, "src\\test\\resources\\test\\en\\brown_tei",
                TeiReader.PARAM_POS_MAPPING_LOCATION, "src\\test\\resources\\test\\en\\en-brown-pos.map",
                TeiReader.PARAM_PATTERNS, new String[] {INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz"}
        );

        //just start the reader and print text + tags
        SimplePipeline.runPipeline(
                brownCorpus,
                AnalysisEngineFactory.createEngineDescription(OpenNlpSegmenter.class),
                AnalysisEngineFactory.createEngineDescription(OpenNlpPosTagger.class,
                        OpenNlpPosTagger.PARAM_POS_MAPPING_LOCATION, "src\\test\\resources\\test\\en\\en-brown-pos.map"));

After the pipeline ran once, I try to print out the original and detected POS 
tags. The problem is, the detected POS tags seem to be totally incorrect. The 
tagger writes POS Tags like "PRP$", which only exist in the PTB tag set. 
Altough in the console it says that the brown mapping is loaded, it seems to me 
that the PTB mapping is loaded instead. The original POS tags are correctly 
read from the reader and correctly printed out.

I tried with both DKPro Core 1.5.0 and the newest 1.6.1 and get the same 
result. 

Original issue reported on code.google.com by onurs3...@googlemail.com on 2 Jul 2014 at 11:20

GoogleCodeExporter commented 9 years ago
The mapping that you configure on POS tagger components cannot be used to map 
between two fine-grained tagsets (e.g. PTB -> Brown or vice versa). I'll 
briefly explain what this mapping is for and then suggest several alternatives.

The DKPro Core type system contains UIMA annotation types representing 
coarse-grained tags (very similar to the Universal POS tags 
https://code.google.com/p/universal-pos-tags/). The mapping that you can 
configure specifies how to map a specific fine-grained tagset used in a corpus 
or produced by a tagger to these coarse-grained tags. In your example, you 
configure OpenNlpTagger to assume that the model produces tags from the Brown 
tagset and use the Brown mapping for the coarse-grained tags. However, the 
default OpenNlpTagger model for English produces PTB tags (and also by default 
uses the correct PTB->coarse-grained mapping).

E.g. to select all verbs based on the coarse grained UIMA types, you could use

  for (POS p : select(jcas, V.class)) {
    System.out.println(p.getCoveredText() + " " + p.getClass().getSimpleName());
  }

To operate on the fine-grained tags, you would use something like:

  for (POS p : select(jcas, POS.class)) {
    if (p.getPosValue().startsWith("V")) {
      System.out.println(p.getCoveredText() + " " + p.getPosValue());
    }
  }

POS tagging in your example doesn't seem to be necessary at all, because the 
POS tags are read from the Brown corpus by the TeiReader.

If you wanted to apply some higher-level analysis, e.g. run MaltParser, then 
you would need to run a POS tagger, because the MaltParser models for English 
are trained on the PTB tagset. In that case, you would configure the TeiReader 
not to load the POS tags from the corpus. 

Something that might meet your needs is the PosMapper [1] component. PosMapper 
allows to rewrite the fine-grained POS tags, e.g. to map the PTB variant 
produced by TreeTagger to the standard PTB tagset. If there is a proper 
conceptual mapping between the Brown and PTB tagset, then you could also use 
PosMapper to convert from Brown -> PTB. The mapping file format should be just 
like

oldtag1=newtag1
oldtag2=newtag2
... and so on

[1] 
http://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags/
latest-release/apidocs/de/tudarmstadt/ukp/dkpro/core/posfilter/PosMapper.html

Original comment by richard.eckart on 3 Jul 2014 at 6:11

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 6 Aug 2014 at 8:24