qwaider / heideltime

Automatically exported from code.google.com/p/heideltime
0 stars 0 forks source link

Erroneous Date Recognized #8

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. HeidelTimeStandalone hts_sci = new HeidelTimeStandalone(Language.ENGLISH, 
DocumentType.NARRATIVES, OutputType.TIMEML);
    String f = hts_sci.process("19-Nov-12", new Date(2012,01,05), new TimeMLResultFormatter());

It should pick out 19, November, 2012. Instead this is the result produced:

<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>
19-<TIMEX3 tid="t0" type="DATE" value="3912-11">Nov</TIMEX3>-12
</TimeML>

What is the work-around for this?

Original issue reported on code.google.com by shripha...@gmail.com on 4 Mar 2013 at 8:06

GoogleCodeExporter commented 9 years ago
Hi,
"19-Nov-12" is not a format, HeidelTime knows yet. However, it is quite easy to 
add a rule, which identifies such expressions.
Go to: resources/english/rules/resources_rules_daterules.txt

And add the following rule:
RULENAME="date_r0g",EXTRACTION="%reDayNumber-%reMonthShort-%reYear2Digit",NORM_V
ALUE="UNDEF-centurygroup(3)-%normMonth(group(2))-%normDay(group(1))"
--> whether the century is normalized correctly depends on the context of the 
expression in the document

Maybe, you also want to include a rule for expressions such as "19-Nov-2012":
RULENAME="date_r0h",EXTRACTION="%reDayNumber-%reMonthShort-%reYear4Digit",NORM_V
ALUE="group(3)-%normMonth(group(2))-%normDay(group(1))"

Then, go to resources/ and run "sh printResourceInformation.sh"

If you don't want to modify the rules, you can also wait until we include these 
rules in the resources. I actually don't see a reason not to include them.

Thanks for your feedback. If you have any further questions, please let me know.

Best regards,
Jannik

Original comment by jannik.s...@gmail.com on 4 Mar 2013 at 10:01

GoogleCodeExporter commented 9 years ago

Original comment by jannik.s...@gmail.com on 4 Mar 2013 at 10:06

GoogleCodeExporter commented 9 years ago
Hi. Adding rules by hand is not scaleable for me. I am working with a corpus 
that is a few gigabytes in size and there are a ton of formats in which dates 
are expressed. Now, what could be helpful is that I get a substring where 
heideltime thinks the date is. For example:

Last post on 26-Nov-12. Next post on 27-Nov-12. Currently the timex quotes are 
placed around Nov and that is not too helpful. If I could get "26-Nov-12" as 
the potential-temporal-expression, then I could do something about the rules 
(mechanical turk for example).

Would this be possible?

Original comment by shripha...@gmail.com on 4 Mar 2013 at 10:01

GoogleCodeExporter commented 9 years ago
Hi,

As mentioned above, it is actually not a big deal to add a couple of rules. 
They make use of regular expressions and are thus quite general. 
My mentioned rules are written in a verbatim style and easily extendable. 
If you need more specific help, e.g., if you have a couple of other patterns 
that you identified to occur frequently, we can make this offline the thread. 
Just send me an email with some more information.

Nevertheless, it would be possible to write rules given you an identified 
temporal expression and its surrounding context tokens, but this would not be 
in a way, HeidelTime is supposed to work.

I can give you more details if you when. Just send me an email.

Best regards,
Jannik

Original comment by jannik.s...@gmail.com on 6 Mar 2013 at 8:05

GoogleCodeExporter commented 9 years ago
We've added the rule in question. Expressions such as the one from your 
original post will now be extracted and normalized correctly.

The commit containing this rule addition is r54fcd94430ad. The next release 
(HeidelTime 1.5) will contain this rule.

Thanks a lot again for bringing this to our attention!

Original comment by z...@informatik.uni-heidelberg.de on 3 Jul 2013 at 3:44

GoogleCodeExporter commented 9 years ago

Original comment by z...@informatik.uni-heidelberg.de on 18 Sep 2013 at 8:48