Descriptors of Chinese text

GoogleCodeExporter commented 9 years ago

Hi,

I would like to process Chinese documents, they are .txt files without any  
tokenization and segmentation. which reader and annotation descriptors should I 
choose? 

I have tested the FileSystemCollectionReader in the example project of UIMA, 
and ACETernReader, Eventi2014Reader, Tempeval2Reader and Tempeval3Reader in 
heideltime, with the StanfordPosTagger and Heideltime annotation descriptors, 
but I cannot get the right result under all these choices! Actually, the output 
has no any annotation.

while I copy the input text from the .txt file into the heideltime online demo 
input dialogue, it works, how does the demo generate the right result? Do I 
need to write a new reader for my input by myself?

my heideltime version is: heideltime-kit 1.8.
and my system is: ubuntu 14.04.1

By the way, my input is a piece of Chinese news, the creation time is January 
29, 2014. the input is as follows:
        中新网1月29日电 综合马来西亚、新加坡等媒体消息，马来西亚民航局总监阿兹哈鲁丁于29日下午6时，通过国营电视台TV1针对MH370事件最新进展作出汇报。
　　阿扎鲁丁说，在经过长时间“分析和推测”后，大马民��
�局今天正式确认马航MH370飞机“失事”，并推定机上所有239��
�乘客和机组人员已遇难。飞机目前位于印度洋南部偏远海底�
��
　　此前，大马方面原定于1月29日下午3时30分就马航MH370失联
一事做出说明，但发布会先是因“技术问题”被延迟，随后��
�因“意外情况”取消。

The expected output is as the output of heideltime online demo, which annotates 
the "1月29日", "今天", "目前", "下午3时30分".

But the actual output with heideltime-kit 1.8 is the same with the input 
without any annotation.

Thank you very much!

Best, 

Lin

Original issue reported on code.google.com by eriney...@gmail.com on 24 Apr 2015 at 4:02

GoogleCodeExporter commented 9 years ago

Hi Lin,

HeidelTime needs Sentence and Token annotations to produce Timex3 annotations. 
These are not (and probably should not be) generated by any reader we currently 
distribute, so in addition to a UIMA reader, HeidelTime and a consumer, you 
will also need to add a preprocessing annotator to your workflow that does this 
for you.

Since you're under Linux, I recommend that you take a look at the 
TreeTaggerWrapper which performs this task given a properly set up TreeTagger 
installation. We've outlined the steps you need to perform in order to get one 
in our readme file:
https://code.google.com/p/heideltime/source/browse/doc/readme.txt#170

Alternatively, you can take a look at setting up the StanfordPOSTaggerWrapper, 
this too is covered by the readme with respect to Arabic processing. All you'll 
need to do differently is substitute your preferred Chinese model file.

If you're not sure yet about which UIMA consumer to use, I think the 
TempEval3Writer is a good choice as it outputs TimeML-compatible XML code.

When everything is set up, processing your text should yield results identical 
to the ones you obtain from our online demo (TreeTagger), or at least similar 
ones.

If neither of these approaches work out, please let me know about the specific 
errors that occur.

Kind regards,
Julian

Original comment by z...@informatik.uni-heidelberg.de on 24 Apr 2015 at 4:19

GoogleCodeExporter commented 9 years ago

Brief addendum for completeness. Your workflow should look like this:

Reader: Filesystem reader (Apache UIMA -- will read any file in the specified 
directory)
Annotator 1: TreeTaggerWrapper or StanfordPOSTaggerWrapper (HeidelTime)
Annotator 2: HeidelTime
Annotator 3: ...
Consumer: TempEval3Writer (or similar; anything that writes our Timex3 
annotations)

Original comment by z...@informatik.uni-heidelberg.de on 24 Apr 2015 at 4:21

GoogleCodeExporter commented 9 years ago

Hi Julian,

Thank you very much!

I have worked out with the TreeTagger annotator, the version of TreeTagger
caused the problem, I used the latest one, but now I replaced it with the
suggest version, it successfully generates the same output with the demo.

But I am not sure the *StanfordPOSTagger* works correctly. The same input
in Chinese,  the configuration with StanfordPOSTagger annotate only one
temporal expression, and after a few tests, I find *it can only annotate
the temporal expressions with special location features* that are appearing
in the beginning of a paragraph and following with a comma. Maybe the
StanfordPOSTagger doesn't work in the workflow. In addition, if I change
the language of input from Chinese to English, it works well.

My workflow with StanfordPOSTagger is as follows:

Reader: FileSystemCollectionReader.xml
    Encoding: GBK
    Language: chinese
Annotator: StanfordPOSTaggerWrapper.xml(version is 3.3.1) and HeidelTime.xml
    Model:
pathto/stanford-postagger-full-2014-01-04/models/chinese-distsim.tagger
    Annotate_tokens: checked
    Annotate_sentences: checked
    Annotate_partofspeech: checked
Writer: Tempeval3Writer.xml

Best,

Lin

2015-04-24 12:22 GMT-04:00 <heideltime@googlecode.com>:

Original comment by eriney...@gmail.com on 28 Apr 2015 at 7:22

GoogleCodeExporter commented 9 years ago

Hi Lin,

thank you for your feedback. You're right, our wrapper doesn't output good 
tagging because - from what I can tell - StanfordPOSTagger always uses (or at 
least only supplies) the PTB tokenizer which is pretty good for latin 
alphabets, but pretty terrible for Chinese. This results in tokens that are 
basically as wide as entire sentences, mostly because there are almost no 
whitespaces in chinese texts. This in turn leads boundary checks inside 
HeidelTime to discard anything that isn't at least one complete token.

This is unfortunate. I guess I can stop recommending StanfordPOSTaggerWrapper 
as an alternative to people for Chinese processing.

Glad you got the TreeTaggerWrapper working!

Regards,
Julian

Original comment by z...@informatik.uni-heidelberg.de on 28 Apr 2015 at 8:16

Changed state: Done

microth / heideltime

Descriptors of Chinese text #28