microth / heideltime

Automatically exported from code.google.com/p/heideltime
4 stars 1 forks source link

Sentence splitting bug in de.unihd.dbs.uima.annotator.stanfordtagger.StanfordPOSTaggerWrapper #16

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi heideltime team,

I'm Master Student at the University of Mannheim and currently building an 
Temporal Information Extraction system using heideltime as a temporal tagger.
I encountered a bug in the StanfordPOSTaggerWrapper UIMA component

What steps will reproduce the problem?
1. Check the attached file "Breaking_Sample.txt"; it's a plain text version of 
Apple's Wikipedia article.
2. Apply de.unihd.dbs.uima.annotator.stanfordtagger.StanfordPOSTaggerWrapper on 
it
3. Check the JCas sentence annotations, respectively the sentences text you get 
when building substrings on the annotations "begin" and "end" indexes.

What is the expected output? What do you see instead?
Expected: Sentences as shown in "Output_MyStanfordPOSTaggerWrapper.txt"
Actual: Sentences as shown in "Output_StanfordPOSTaggerWrapper.txt"
Issue starts with Sentence 117

What version of the product are you using? On what operating system?
1.7
OS X

Please provide any additional information below.
Results of my analysis are as following:
The weakness of the current implementation is the own calculation  of an offset 
value in conjunction with
relying on searching the document text with ".indexOf(thisWord, offset)".

To fit my needs I copied and reimplemented your component the code can be found 
in "MyStanfordPOSTaggerWrapper.java".
From my perspective this implementation is more robust as it reuses the offsets 
calculated by the Stanford Tokenizer.

If you have further questions please do not hesitate to contact me.

Bests
Norman

Original issue reported on code.google.com by norman.w...@gmail.com on 23 Jul 2014 at 2:31

Attachments: