microth / heideltime

Automatically exported from code.google.com/p/heideltime
4 stars 1 forks source link

erroneous week calculation based on dctWeek/getxnextweek #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
there's something going wrong with the week values generated in 
HeidelTime.specifyAmbiguousValues() likely originating from 
system(/locale)-specific behavior within DateCalculator.getXNextWeek().

this results in week numbers that are decremented by one compared to their gold 
standard values. likely in relation to the relative value calculation based on 
dctWeek.

Original issue reported on code.google.com by jul...@gmail.com on 23 May 2012 at 3:57

GoogleCodeExporter commented 9 years ago
The problem appears to be the use of Java's SimpleDateFormat for the 
calculation of the DCT's week number. SimpleDateFormat uses the system's locale 
for that, and there are differences to how the week number is calculated.

According to the TimeML specification at 
http://www.timeml.org/site/publications/timeMLdocs/timeml_1.2.1.html, TimeML 
uses the ISO 8601 standard for the notation of all time values.

http://en.wikipedia.org/wiki/Seven-day_week#Week_numbering lists separate 
numbering systems and mentions that the US does not adhere to ISO 8601.

The solution should be to use a DateFormat that employs the ISO 8601 criterion 
for calculating the week number.

Original comment by jul...@gmail.com on 25 May 2012 at 11:49

GoogleCodeExporter commented 9 years ago
The solution will work like this:

We'll have HeidelTime default to a locale that adheres to the ISO 8601 
standard, but implement:
1. a command line switch for the standalone version and
2. an annotator setting for the UIMA version,
each facilitating the use of a different locale by supplying a string of the 
form "en_US", "en_GB" or "de_DE".

The first two characters' possible values are listed in ISO 639: 
http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt

The latter two characters' possible values are listed in ISO 3166: 
http://userpage.chemie.fu-berlin.de/diverse/doc/ISO_3166.html

More information to follow at a later point.

Original comment by jul...@gmail.com on 25 May 2012 at 4:03

GoogleCodeExporter commented 9 years ago
rdaf916550c50 implements support for supplying a locale string of the above 
format using the heideltime annotator descriptor file.

the default locale has been set to Locale.UK, which is a hardcoded constant in 
the JavaSE-1.6 environment and should thus be accessible regardless of the 
underlying system.

any locale string you supply needs to be an existing locale in the host system 
- if you supply one that is inaccessible to heideltime, it will quit and 
display those that are usable by java.

Original comment by jul...@gmail.com on 27 Jun 2012 at 12:21

GoogleCodeExporter commented 9 years ago
r31cd9e142f90 adds choosable locale support to the standalone version.

this should complete this issue.

Original comment by jul...@gmail.com on 27 Jun 2012 at 3:06

GoogleCodeExporter commented 9 years ago

Original comment by j.z...@stud.uni-heidelberg.de on 18 Apr 2013 at 11:30