google-code-export / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

self-built TreeTagger does not assign POS Subtypes #402

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

I'm using a self-built TreeTagger model as described on your homepage. However, 
it only assigns the 'POS' tag as annotations, but not the subtypes (N, NN,...).

Yours,
Laura

Original issue reported on code.google.com by Steinert...@googlemail.com on 4 Jun 2014 at 9:38

GoogleCodeExporter commented 9 years ago
If you explicitly specify a model using PARAM_MODEL_LOCATION or 
PARAM_MODEL_PATH (depending on the DKPro Core version), you should specify a 
mapping file using PARAM_TAGGER_MAPPING_LOCATION.

Maybe there is already a suitable mapping file included with DKPro Core that 
you could use. Are you using a standard tagset, if so, which one?

Original comment by richard.eckart on 4 Jun 2014 at 9:47

GoogleCodeExporter commented 9 years ago
I was unable to give the model explicitely via a variable (it's a project which 
is called from another project...). Therefore, I copied all the built 
TreeTagger files into my resources folder.

However, I don't think that I have a mapping file (neither included in DKPro 
Core nor in the built TreeTagger files). Where could I get one and how would I 
have to specify it in my sourcecode/project properties? I want to get the POS 
tags.

Original comment by Steinert...@googlemail.com on 5 Jun 2014 at 1:13

GoogleCodeExporter commented 9 years ago
I need to better understand what you did and what you are trying to do.

If you just want to access the pos tags that the treetagger produces, that is 
easy. To get the POS tag of a token, you can do this:

token.getPos().getPosValue()

If this is all you want, you can stop here.

A mapping is only required if you want to use the coarse-grained POS types that 
you could use in a statement like

JCasUtil.select(jcas, N.class)

Apparently you already managed to train a model and to instruct the DKPro Core 
treetagger component to use it. To help you further, I would need to know how 
you configured the TreeTagger component to use your model. E.g. what exactly 
are you referring to when you say that you used a self-built model as described 
on our homepage, and I would need to know how you configure and invoke the 
TreeTaggerPosLemmaTT4J component.

Original comment by richard.eckart on 5 Jun 2014 at 8:23

GoogleCodeExporter commented 9 years ago
Okay, I want to use a keyphrase extractor from DKPro Keyphrases, e.g. the 
PositionBaseline Extractor. These, however, need the POS tags to filter the 
tokens.
Here's an example code snippet:

Candidate nounTokens = new Candidate(CandidateType.Token, PosType.N);
KeyphraseExtractor_ImplBase positionBaselineExtractor = new 
PositionBaselineExtractor();
positionBaselineExtractor.setCandidate(nounTokens);
AnalysisEngine extractor = positionBaselineExtractor.getKeyphraseEngine();      

JCas jcas = extractor.newJCas();
jcas.setDocumentText(text);        
extractor.process(jcas);       
JCasUtil.select(jcas, Keyphrase.class);

This code uses the TreeTagger. In theory it should only return nouns as 
keyphrases, however I receive all words of the input text regardless of POS tag.
Therefore, I checked what POS tags the jcas holds with:

for (POS pos : JCasUtil.select(jcas, POS.class)) {
    System.out.println(pos);
}

And this gives me only 'POS' as tags. Hence, it does not know any subtypes, 
such as 'NN' or 'N'. If I change the code of DKPro Core's class 
'KeyphraseExtractor_ImplBase's createTagger method to use a OpenNlpPosTagger 
instead, the PositionBaseline extractor works the way it should.

I built the TreeTagger as described here:
http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources

Original comment by Steinert...@googlemail.com on 6 Jun 2014 at 8:13

GoogleCodeExporter commented 9 years ago
Info about DKPro Keyphrases

In KeyphraseExtractor_ImplBase the TreeTagger is invoked like that:

return createEngineDescription(
                    TreeTaggerChunkerTT4J.class,
                    TreeTaggerChunkerTT4J.PARAM_LANGUAGE, getLanguage()
            );

So the model is not explicitly added, but loaded via the CAS language.

Original comment by torsten....@gmail.com on 6 Jun 2014 at 8:30

GoogleCodeExporter commented 9 years ago
You said that you are using a self-trained treetagger model. Is it correct that 
you extended the build.xml file to package your own model as a jar? (cf. [1])

[1] 
https://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI#Packaging_reso
urces_as_JARs

Original comment by richard.eckart on 6 Jun 2014 at 10:09

GoogleCodeExporter commented 9 years ago
No, that is a misunderstanding. I used the build.xml as given by the project. I 
do not use any self-trained model. All I did was packaging the TreeTagger with 
the build.xml and copying it into the resource folder of my own project.

Original comment by Steinert...@googlemail.com on 10 Jun 2014 at 8:24

GoogleCodeExporter commented 9 years ago
The build.xml file creates various JARs in the folder called "target". Instead 
of copying these JARs to your resources folder, add them to you classpath. In 
Eclipse you can do this e.g. by right-clicking on them and select "Build path 
-> Add to build path". Alternatively, they can be added via Maven.

If you are using DKPro Core 1.5.x, there should be one JAR per language. If you 
add that, it should also give you the pos tag mapping.

If you are using DKPro Core 1.6.x, there should be two JARs per model, one 
"model" JAR and one "upstream" JAR. Make sure to have added both to the 
classpath in order to get the mapping.

If this does not help:
There might not a mapping for all languages and all models. Which language are 
you processing?

Original comment by richard.eckart on 10 Jun 2014 at 9:05

GoogleCodeExporter commented 9 years ago
I now added the treetagger-bin jar as well as the treetagger-model-en jar to my 
build path. However, the problem remains. I am processing english texts.

Original comment by Steinert...@googlemail.com on 10 Jun 2014 at 1:18

GoogleCodeExporter commented 9 years ago
Same problem here. When i try to get the coarse-grained POS tags with 
TreeTagger, i only get "POS" instead of "N" or "V". I am trying to print the 
tags as the following:

for (Token tokenAnno : JCasUtil.select(jcas, Token.class)) {
           System.out.println(tokenAnno.getPos().getClass().getSimpleName());

The whole thing works for german language, but does not work for english 
language. It does not even load the tagsets. The only way to get coarse-grained 
tags for english language is mapping TreeTagger to a tagset (e.g. 
"en-pos.map"). 

Maybe you, Steinert can test it for german language? If this works, then maybe 
there are problems with english texts and TreeTagger..

Original comment by onurs3...@googlemail.com on 10 Jun 2014 at 6:42

GoogleCodeExporter commented 9 years ago
I'll look into it. Can you please tell me which version of DKPro Core you are 
using (mind that -all- DKPro Core JARs in your projects should have the same 
version - you should not mix versions) and what are the full JAR names 
(including the version) of the model files that you are using. If you know, it 
might also be helpful to know the URL/svn revision of the build.xml files that 
you used.

Original comment by richard.eckart on 10 Jun 2014 at 8:54

GoogleCodeExporter commented 9 years ago
Okay, here come the SVN revisions I'm using:

build.xml: 25
de.tudarmstadt.ukp.dkpro.core.treetagger-asl: 2281

The versions I use in my POM: de.tudarmstadt.ukp.dkpro.core.treetagger-asl: 
1.5.0 (although I checked the project out locally to build it, I still have the 
POM set to use a version via a repository).

The names of the jars:
treetagger-bin-20131118.0.jar
treetagger-model-en-20111109.1.jar

Original comment by Steinert...@googlemail.com on 11 Jun 2014 at 7:56

GoogleCodeExporter commented 9 years ago
Update: It's also working for me when using German texts.

Original comment by Steinert...@googlemail.com on 11 Jun 2014 at 8:18

GoogleCodeExporter commented 9 years ago
treetagger-model-en-20111109.1.jar is a model for DKPro Core 1.6.0 [1]. It 
declares the tagset "ptb-tt" which is not known to DKPro Core 1.5.0. Hence, 
DKPro Core 1.5.0 falls back to mapping every tag to the POS annotation type 
(and storing the actual pos-tag only in the posValue feature of the POS 
annotation.

You should use the build.xml file for DKPro Core 1.5.0 [2] which declares the 
"ptb" tagset. 

No matter what build.xm file you use, you might find that some models have 
meanwhile been updated on the TreeTagger homepage and some md5 hashes may no 
longer match that build.xml file. Since TreeTagger models and binaries cannot 
be redistributed due to license restrictions, this should not cause problems. 
If you care about versioning and want to stay clear of potential version 
conflicts with future build.xml files, I would recommend you add some suffix to 
the version, e.g. "20111109.0-steinert".

TreeTagger model packaging will change in DKPro Core 1.7.0 and then follow the 
packaging conventions also used for other models/resources.

I assume this should resolve your problem. I am marking this issue as "invalid" 
because it does not require changes or further actions on our part. If your 
problems is not resolved or if you feel that further action on our part is 
necessary, feel free to comment and reopen the issue.

[1] 
https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags
/de.tudarmstadt.ukp.dkpro.core-asl-1.6.0/de.tudarmstadt.ukp.dkpro.core.treetagge
r-asl/src/scripts/build.xml
[2] 
https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags
/de.tudarmstadt.ukp.dkpro.core-asl-1.5.0/de.tudarmstadt.ukp.dkpro.core.treetagge
r-asl/src/scripts/build.xml

Original comment by richard.eckart on 11 Jun 2014 at 7:28