microth / heideltime

Automatically exported from code.google.com/p/heideltime
4 stars 1 forks source link

"Charset mismatch" when running the standalone version under Ubuntu #17

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I have previously used HeidelTime's standalone version on Mac OSX and on a 
Lubuntu machine without any problem. 
Today I tried it on Ubuntu (running as VirtualBox) and can't get rid of this 
mistake:

usr@usr-VirtualBox:~/Downloads/Temporal_Annotation_Initial/Standalone$ java 
-jar de.unihd.dbs.heideltime.standalone.initial.jar 
/home/usr/Temporal_Annotation/lill_sample.txt -l german
Error: Unable to access jarfile de.unihd.dbs.heideltime.standalone.initial.jar
usr@usr-VirtualBox:~/Downloads/Temporal_Annotation_Initial/Standalone$ java 
-jar de.unihd.dbs.heideltime.standalone.jar 
/home/usr/Temporal_Annotation/lill_sample.txt -l german
java.lang.RuntimeException: Opps! Could not find token f�rbringen in JCas 
after tokenizing with TreeTagger. Hmm, there may exist a charset missmatch! 
Default encoding is UTF-8 and should always be UTF-8 (use 
-Dfile.encoding=UTF-8). If input document is not UTF-8 use -e option to set it 
according to the input, additionally.
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenize(TreeTaggerWrapper.java:262)
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.process(TreeTaggerWrapper.java:221)
    at de.unihd.dbs.heideltime.standalone.components.impl.TreeTaggerWrapper.process(TreeTaggerWrapper.java:43)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishPartOfSpeechInformation(HeidelTimeStandalone.java:388)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishHeidelTimePreconditions(HeidelTimeStandalone.java:336)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:481)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:430)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.main(HeidelTimeStandalone.java:728)
<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>

178. Frondienst der Wurmsbacher Lehenleute 
1605 <TIMEX3 tid="t4" type="DATE" value="XXXX-06-06">Juni 6</TIMEX3>. 
Rapperswil 
Uf fürbringen unnd clagen der frawen äbbtißin unnd convent des würdigen 
gozhuß Wurmbspach gegen und wider jre lehenlüthen zu Wagen, jm Buech und in 
der Auw: Das dieselbigen vermeinen, die wyl die fraw andere jres gozhuß 
güetter verlichen, also dz sy keiner acherlüthen nit mer mangelbar und aber 
so sy die behallten, jnnen die ächer zebuwen 12 tag schuldig und sonnsten nit.
</TimeML>

It annotates the first occurance of a temporal expression, and then stops... 
My input files are in UTF8:

usr@usr-VirtualBox:~/Temporal_Annotation$ file lill_sample.txt 
lill_sample.txt: UTF-8 Unicode text, with very long lines

I also tried to indicate -Dfile.encoding=UTF-8, as the error message says, but 
it doesn't help a bit...

Original issue reported on code.google.com by natak...@gmail.com on 28 Jul 2014 at 2:27

GoogleCodeExporter commented 9 years ago
Hey, and thanks for the issue report.

I've run your command with your sample on my end (Ubuntu 14.04) and it seems to 
work fine; "12 tag" later in that line is recognized correctly and no error is 
spat out.
As the problem occurs in the TreeTaggerWrapper component: Could you check if 
your TreeTagger/parameter files' integrity is okay, e.g. by re-downloading 
them? The URLs are here: 
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

If the problem still occurs, please run something akin to:

cat lill_sample.txt | ./tree-tagger-german > out.txt

from the treetagger/cmd folder and attach the output file here? Thanks!

Original comment by z...@informatik.uni-heidelberg.de on 28 Jul 2014 at 5:36

GoogleCodeExporter commented 9 years ago
Please find the out.txt attached.
Thank you for trying to help)

Original comment by natak...@gmail.com on 28 Jul 2014 at 6:02

Attachments:

GoogleCodeExporter commented 9 years ago
Okay, that looks good. I'm not sure right now what could be causing this.
could you please give me the outputs of:
1. locale
and
2. locale -a
? Thanks.

Original comment by z...@informatik.uni-heidelberg.de on 29 Jul 2014 at 12:37

GoogleCodeExporter commented 9 years ago
Here they are:

/cmd$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_CH.UTF-8
LC_TIME=de_CH.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_CH.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_CH.UTF-8
LC_NAME=de_CH.UTF-8
LC_ADDRESS=de_CH.UTF-8
LC_TELEPHONE=de_CH.UTF-8
LC_MEASUREMENT=de_CH.UTF-8
LC_IDENTIFICATION=de_CH.UTF-8
LC_ALL=

/cmd$ locale -a
C
C.UTF-8
de_CH.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX

Original comment by natak...@gmail.com on 29 Jul 2014 at 5:29

GoogleCodeExporter commented 9 years ago
Okay, those look good, too.
Honestly I'm a bit stumped, because if the input file is UTF-8 and the 
system/java vm support UTF-8, and the TreeTagger output is good (all of which 
appears to be the case), then this issue shouldn't occur.
Can you attach your actual sample file "lill_sample.txt" that fails to process?
And what are your Ubuntu and Java versions? (lsb_release -a && java -version)

Original comment by z...@informatik.uni-heidelberg.de on 29 Jul 2014 at 9:41

GoogleCodeExporter commented 9 years ago
cmd$ lsb_release -a && java -version
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.1 LTS
Release:    14.04
Codename:   trusty
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Please find the input file attached.
Thank you))

Original comment by natak...@gmail.com on 29 Jul 2014 at 9:54

Attachments:

GoogleCodeExporter commented 9 years ago
Just to keep you updated, I've managed to reproduce this issue and will try to 
fix it. Unfortunately I can't give you a workaround as of now.
I'll update this issue as soon as I have something.

Original comment by z...@informatik.uni-heidelberg.de on 31 Jul 2014 at 10:01

GoogleCodeExporter commented 9 years ago
This seems to be good news! Thanks! Waiting for your next update.

Original comment by natak...@gmail.com on 31 Jul 2014 at 10:04