Closed GoogleCodeExporter closed 9 years ago
Hi Lin,
HeidelTime needs Sentence and Token annotations to produce Timex3 annotations.
These are not (and probably should not be) generated by any reader we currently
distribute, so in addition to a UIMA reader, HeidelTime and a consumer, you
will also need to add a preprocessing annotator to your workflow that does this
for you.
Since you're under Linux, I recommend that you take a look at the
TreeTaggerWrapper which performs this task given a properly set up TreeTagger
installation. We've outlined the steps you need to perform in order to get one
in our readme file:
https://code.google.com/p/heideltime/source/browse/doc/readme.txt#170
Alternatively, you can take a look at setting up the StanfordPOSTaggerWrapper,
this too is covered by the readme with respect to Arabic processing. All you'll
need to do differently is substitute your preferred Chinese model file.
If you're not sure yet about which UIMA consumer to use, I think the
TempEval3Writer is a good choice as it outputs TimeML-compatible XML code.
When everything is set up, processing your text should yield results identical
to the ones you obtain from our online demo (TreeTagger), or at least similar
ones.
If neither of these approaches work out, please let me know about the specific
errors that occur.
Kind regards,
Julian
Original comment by z...@informatik.uni-heidelberg.de
on 24 Apr 2015 at 4:19
Brief addendum for completeness. Your workflow should look like this:
Reader: Filesystem reader (Apache UIMA -- will read any file in the specified
directory)
Annotator 1: TreeTaggerWrapper or StanfordPOSTaggerWrapper (HeidelTime)
Annotator 2: HeidelTime
Annotator 3: ...
Consumer: TempEval3Writer (or similar; anything that writes our Timex3
annotations)
Original comment by z...@informatik.uni-heidelberg.de
on 24 Apr 2015 at 4:21
Hi Julian,
Thank you very much!
I have worked out with the TreeTagger annotator, the version of TreeTagger
caused the problem, I used the latest one, but now I replaced it with the
suggest version, it successfully generates the same output with the demo.
But I am not sure the *StanfordPOSTagger* works correctly. The same input
in Chinese, the configuration with StanfordPOSTagger annotate only one
temporal expression, and after a few tests, I find *it can only annotate
the temporal expressions with special location features* that are appearing
in the beginning of a paragraph and following with a comma. Maybe the
StanfordPOSTagger doesn't work in the workflow. In addition, if I change
the language of input from Chinese to English, it works well.
My workflow with StanfordPOSTagger is as follows:
Reader: FileSystemCollectionReader.xml
Encoding: GBK
Language: chinese
Annotator: StanfordPOSTaggerWrapper.xml(version is 3.3.1) and HeidelTime.xml
Model:
pathto/stanford-postagger-full-2014-01-04/models/chinese-distsim.tagger
Annotate_tokens: checked
Annotate_sentences: checked
Annotate_partofspeech: checked
Writer: Tempeval3Writer.xml
Best,
Lin
2015-04-24 12:22 GMT-04:00 <heideltime@googlecode.com>:
Original comment by eriney...@gmail.com
on 28 Apr 2015 at 7:22
Hi Lin,
thank you for your feedback. You're right, our wrapper doesn't output good
tagging because - from what I can tell - StanfordPOSTagger always uses (or at
least only supplies) the PTB tokenizer which is pretty good for latin
alphabets, but pretty terrible for Chinese. This results in tokens that are
basically as wide as entire sentences, mostly because there are almost no
whitespaces in chinese texts. This in turn leads boundary checks inside
HeidelTime to discard anything that isn't at least one complete token.
This is unfortunate. I guess I can stop recommending StanfordPOSTaggerWrapper
as an alternative to people for Chinese processing.
Glad you got the TreeTaggerWrapper working!
Regards,
Julian
Original comment by z...@informatik.uni-heidelberg.de
on 28 Apr 2015 at 8:16
Original issue reported on code.google.com by
eriney...@gmail.com
on 24 Apr 2015 at 4:02