dateparser is a smart and high-performance date parser library, it supports hundreds of different formats, nearly all format that we may used. And this is also a showcase for "retree" algorithm.
MIT License
96
stars
24
forks
source link
Unneeded patterns/rules influence the result of the parsing #29
This might be an issue with retree rather than with the dateparser though.
The following test (which you cannot execute via the public API) fails:
@Test
public void parserWithLimitedPatterns(){
List<String> rules = Arrays.asList(
"(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
"\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
" ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
);
DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
String input = "2022-08-09 19:04:31.600000+00:00";
Date date = dateParser.parseDate(input);
assertEquals(parser.parseDate(input), date);
}
Note how those 3 rules should be sufficient to parse the date.
There is a rule for the year-month-day part
There is a rule for the hours:minutes:seconds.ns part
There is a rule for the zone offset part
However, during parsing the zoneoffset rule is never used. Instead, it uses the rule for the hours twice.
The weird thing is that when I add a rule that should not be used (`" ?(?\d{4})$"), the test suddenly succeeds:
@Test
public void parserWithLimitedPatterns(){
List<String> rules = Arrays.asList(
"(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
" ?(?<year>\\\\d{4})$",
"\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
" ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
);
DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
String input = "2022-08-09 19:04:31.600000+00:00";
Date date = dateParser.parseDate(input);
assertEquals(parser.parseDate(input), date);
}
The position where I add that additional rule is important. For example adding it at the end of the list instead of at index 1 makes the test fail again.
This might be an issue with
retree
rather than with thedateparser
though.The following test (which you cannot execute via the public API) fails:
Note how those 3 rules should be sufficient to parse the date.
However, during parsing the zoneoffset rule is never used. Instead, it uses the rule for the hours twice.
The weird thing is that when I add a rule that should not be used (`" ?(?\d{4})$"), the test suddenly succeeds:
The position where I add that additional rule is important. For example adding it at the end of the list instead of at index 1 makes the test fail again.
I bumped into this issue for PR https://github.com/sisyphsu/dateparser/pull/28 , where I try to reduce the number of rules that are used for parsing to improve the performance.