Closed GoogleCodeExporter closed 9 years ago
Do you have any idea how they officially call this format? I know "Penn
Treebank" format only as the bracketed structure.
Original comment by richard.eckart
on 1 Aug 2014 at 11:34
>> - corpora contain noun phrase annotations (in addition to the tags), is
there a type to annotate noun phrases in DKPro?
one possibility is to use the type Constituent and set constituentType to
nounPhrase
Judith
Original comment by eckle.kohler
on 1 Aug 2014 at 11:40
@noun phrases: I'd suggest using "Chunk" annotations
@multiple POS tags: I'd take only the first one
@no-tag: there is no 'no tag' value, but I think there is a . You could simply
not have a POS annotation for those tags. It might cause problems with
downstream components that expect that all tokens have a POS. You could
consider to run a pos tagger which accepts partially pre-tagged tags to fill in
the tags. In principle, TreeTagger could do that, but I believe the DKPro Core
TreeTagger component does handle partially pre-tagged documents.
Original comment by richard.eckart
on 1 Aug 2014 at 11:45
@noun phrases: I'd suggest using "Chunk" annotations
why?
chunks and noun phrases are not the same;
for some users (e.g. me) this might be confusing
Original comment by eckle.kohler
on 1 Aug 2014 at 11:56
[deleted comment]
I don't know the name of the format, sorry. I don't think it has a name, its
forward-slash separated token/tag plain text annotation. The NP marking in
brackets was, I assume, the reason why they added so many line breaks - to make
the text file more easily to process.
Following new case, they also annotated if a word was misspelled and if yes
they added the tag it should have had if it were written correctly as in the
example:
the/DT students/^NNS^POS parents/NNS
the missing ' caused the NNS, but it should have been students' and thus POS as
tag.
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 12:12
>> @noun phrases: I'd suggest using "Chunk" annotations
> why? chunks and noun phrases are not the same; for some users (e.g. me) this
might be confusing
Constituents are modeled in a hierarchy in DKPro Core (they have a
parent/children references).
Chunks are modelled flat in DKPro Core (they have no such references).
Original comment by richard.eckart
on 1 Aug 2014 at 12:18
>>Constituents are modeled in a hierarchy in DKPro Core (they have a
parent/children references).
>>Chunks are modelled flat in DKPro Core (they have no such references).
ok - but what is actually annotated in the PTB: chunks or noun phrases?
if the "noun phrase" annotation is mapped to DKPro chunks, then information
about the hierachical structure of noun phrases is lost
Original comment by eckle.kohler
on 1 Aug 2014 at 12:31
"Originally, each of the texts was run through PARTS (Ken Church's
stochastic part-of-speech tagger) or Eric Brill's tagger and then corrected
by a human annotator. The square brackets surrounding phrases in the texts
are the output of a stochastic NP parser that is part of PARTS and are best
ignored."
This is how it looks like in the files:
==================================
[ Local/JJ industry/NN 's/POS investment/NN ]
in/IN
[ Rhode/NNP Island/NNP ]
was/VBD
[ the/DT big/JJ story/NN ]
in/IN
[ 1960/CD 's/POS industrial/JJ development/NN effort/NN ]
./.
==================================
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 12:34
in the example you give, Tobias, the things in square brackets are only chunks
- so Richard's suggestion (using Chunk) will be fine
an example of a noun phrase would be
[ Local/JJ industry/NN 's/POS investment/NN in/IN Rhode/NNP Island/NNP ]
Original comment by eckle.kohler
on 1 Aug 2014 at 12:39
Ok, thx for the feedback.
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 12:49
Where should I place the new file? Project:
de.tudarmstadt.ukp.dkpro.core.io.penntree-asl
In the same package as the parsing-related classes or open a new package?
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 2:13
Same package. Just don't call it PennTreebankReader ;) That would be the one
for the bracketed structure. Your's should have a different name.
Original comment by richard.eckart
on 1 Aug 2014 at 2:16
Hm.... feel free to make suggestions, seems like my most favored name is not
available :)
How is:
PTB[Chunked]TaggedCorpusReader
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 2:21
PennTreebankChunkedReader?
Original comment by richard.eckart
on 1 Aug 2014 at 2:22
I think I like your name better :)
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 2:25
ok, I committed.
Original comment by Tobias.H...@gmail.com
on 1 Aug 2014 at 2:40
Please merge all the test cases into one class called
PennTreebankChunkedReaderTest.
For the parameters, please use the standard parameters from
ComponentParameters, e.g.:
/**
* Location of the mapping file for part-of-speech tags to UIMA types.
*/
public static final String PARAM_POS_MAPPING_LOCATION = ComponentParameters.PARAM_POS_MAPPING_LOCATION;
@ConfigurationParameter(name = PARAM_POS_MAPPING_LOCATION, mandatory = false)
protected String posMappingLocation;
For the tests, please use the DKPro Core AssertAnnotations methods, cf.
OpenNlpParserTest and Conll2000ReaderTest.
No need to use PARAM_PATTERNS, you can merge that information into the
PARAM_SOURCE_LOCATION unless you have multiple include/exclude patterns.
For unit tests just write "throws Exception" instead of listing each exception
separately.
Original comment by richard.eckart
on 1 Aug 2014 at 6:53
This issue was updated by revision r2673.
- Formatting / cleaning up
Original comment by richard.eckart
on 2 Aug 2014 at 7:30
This issue was updated by revision r2674.
- Some formatting / cleaning up
Original comment by richard.eckart
on 2 Aug 2014 at 7:36
I updated the recent commit messages. Please check them out in the history to
see how they should be written such that they also update the issue with the
changes (see the two auto-generated comments above).
There are still various things to be fixed in the PennTreebankChunkedReader:
https://code.google.com/p/dkpro-core-asl/source/detail?r=2666
Original comment by richard.eckart
on 2 Aug 2014 at 7:37
Ehm where/how do I see what has to be fixed?
btw. Eclipse uses auto-format .xml files that defines how code is formatted if
the Eclipse-Key-Shortcut is used, you don't use the Eclipse default, aren't
you? Where do I get the DKPro-Version of these files?
Original comment by Tobias.H...@gmail.com
on 2 Aug 2014 at 7:47
Follow the link to revision 2666 in the previous comment and check out all the
Line-by-line comments. One of them includes a link to the Eclipse code style
file as well.
I'm using Eclipse. I format using the keyboard-shortcut, but often I format
only select parts of a file, not the whole file, because some lines I actually
don't like to be auto-formatted, e.g. when I align parameter/value pairs in
createEngineDescription(...) such that there is one pair per line.
Original comment by richard.eckart
on 2 Aug 2014 at 9:02
Hi Tobias,
direct link to the style xml here (from "Downloads"):
https://code.google.com/p/dkpro-core-asl/downloads/detail?name=DKProCoreStyle_20
120326.xml&can=2&q=
Original comment by eriklan.dodinh@gmail.com
on 3 Aug 2014 at 8:23
This issue was updated by revision r2675.
- Fixed value of PARAM_TAGSET in test case.
Original comment by richard.eckart
on 3 Aug 2014 at 9:26
ok, I saw you updated files. Is there anything left to do?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 6:58
Yes - I didn't address many of the comments that I made.
Original comment by richard.eckart
on 4 Aug 2014 at 7:37
Maybe I look at the wrong place, but I see nothing. If I look on the code in
the browser I noticed that I can add comments, but I don't see any already
attached comments? Where do I have to look?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 7:44
If you follow this link:
https://code.google.com/p/dkpro-core-asl/source/detail?r=2666
and you scroll down, you should see a section *Line-by-line comments*.
Original comment by richard.eckart
on 4 Aug 2014 at 7:46
hm, no. I see the section Line-by-line comments, but it says that no comments
have been added yet. Maybe its a permission problem?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 7:55
Stupid me... I haven't used the review tool often yet and forgot to actually
publish the review ;) Now you should be able to see them.
Original comment by richard.eckart
on 4 Aug 2014 at 8:04
Ok, I can see them now.
is there no pre-implemented file-loading code in the other super-class? I do
have to reimplement the entire file loading code? What is the benefit of this
class btw. It seems only less convenient....
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 9:46
> is there no pre-implemented file-loading code in the other super-class? I do
have to reimplement the entire file loading code? What is the benefit of this
class btw. It seems only less convenient....
You mean in JCasResourceCollectionReader_ImplBase? It extends
ResourceCollectionReaderBase (which has the loading code) but it makes sure
that you get a JCas instead of a CAS in the getNext() method.
Original comment by richard.eckart
on 4 Aug 2014 at 9:48
If course your class needs to override getNext(JCas aJCas) now instead of
getNext(CAS cas).
Original comment by richard.eckart
on 4 Aug 2014 at 9:49
Ah ok.
Does select(jcas, Token.class) also work if I inherit from
JCasResourceCollectionReader_ImplBase ? Seemingly not, the method call is
unknown. Whats wrong with JCasUtil?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 10:15
Sure, why shouldn't it work?
Original comment by richard.eckart
on 4 Aug 2014 at 10:22
Do you mean JCasUtil.select or a call to a method select which should have been
inherited? The latter doesn't work.
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 10:27
I mean calling JCasUtil.select. If you turn that into a static import, you can
just call it by "select", e.g.
import static org.apache.uima.fit.util.JCasUtil.select;
for (Sentence sentence : select(aJCas, Sentence.class)) {...
Original comment by richard.eckart
on 4 Aug 2014 at 10:28
Oh ok.
How do I set the mapped UIMA-class if I use JCas instead of CAS?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 11:36
For this aspect only you get the CAS from the JCas and do it traditionally.
Check out e.g. BncReader.
Original comment by richard.eckart
on 4 Aug 2014 at 12:08
Hm, its not working. It does not set the mapped UIMA-value. Is there no method
you can call that configures that automatically, JCas ist a bit easier to use
but these exceptions nullifies these benefits in an instant.
What is wrong with this code? It worked with the ResourceCollectionReader super
class, but under JCasResourceCollectionReader_ImplBase it does not set the
mapped value either.
CAS aCAS = aJCas.getCas();
posMappingProvider.configure(aCAS);
// Token
Type tokenType = aCAS.getTypeSystem().getType(Token.class.getName());
AnnotationFS tokenAnno = aCAS.createAnnotation(tokenType, aCurrPosInText, aTokenText.length()
+ aCurrPosInText);
aCAS.addFsToIndexes(tokenAnno);
Feature feature = tokenType.getFeatureByBaseName("pos");
// Tag
Type posType = posMappingProvider.getTagType(aTag);
// aCAS.getTypeSystem().getT.getFeatureByBaseName("pos");
AnnotationFS posAnno = aCAS.createAnnotation(posType, aCurrPosInText, aTokenText.length());
posAnno.setStringValue(posType.getFeatureByBaseName("PosValue"), aTag);
aCAS.addFsToIndexes(posAnno);
// Set the POS for the Token
tokenAnno.setFeatureValue(feature, posAnno);
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 12:28
This issue was updated by revision r2679.
- Basic conversion to JCasResourceCollectionReader_ImplBase
Original comment by richard.eckart
on 4 Aug 2014 at 12:37
I have performed the basic conversion to JCasResourceCollectionReader_ImplBase.
Please check out the diffs.
Original comment by richard.eckart
on 4 Aug 2014 at 12:38
This issue was updated by revision r2680.
- Updated formatting
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 12:43
I still don't get what is wrong with my earlier postet snippet tho....seems to
be pretty much the same?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 12:47
Well, I'm not sure what exactly you say is not working and how you determine
that it is not working.
Original comment by richard.eckart
on 4 Aug 2014 at 1:09
Never mind.
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 2:41
Why did you undo the changes that I did to the file?
Original comment by richard.eckart
on 4 Aug 2014 at 2:53
I copied you 'Set the pos correctly'-code snippet into my local working copy
and than copied my version over the DKPro one.
What was lost?
Original comment by Tobias.H...@gmail.com
on 4 Aug 2014 at 2:57
This issue was updated by revision r2681.
- Restoring my modifications
Original comment by richard.eckart
on 4 Aug 2014 at 3:01
Original issue reported on code.google.com by
Tobias.H...@gmail.com
on 1 Aug 2014 at 11:12