sisyphsu / dateparser

dateparser is a smart and high-performance date parser library, it supports hundreds of different formats, nearly all format that we may used. And this is also a showcase for "retree" algorithm.
MIT License
95 stars 24 forks source link

Wrong selection of a matching rule #11

Closed rssdev10 closed 3 years ago

rssdev10 commented 3 years ago

Hi, I'm trying to parse dates in a format of month-year. This format without a day is very common for documents like CV. But I found that I cannot add a custom rule e.g. for the following dates:

September 2010
September/2003
DateParser parser = DateParser.newBuilder()
                    .addRule("(?<month>september)\\s{1,4}(?<year>\\d{4})")
                    .addRule("(?<month>\\w+)\\s{1,4}(?<year>\\d{4})")
                    .addRule("(?<month>\\w+)/(?<year>\\d{4})")
                    .build();
Calendar calendar = parser.parseCalendar(date.toLowerCase());

I added custom rules and checked that these must be working fine as a common Regex. But I'm getting an error Text september 2010 cannot parse at 12. The reason is, in the code:

    private void DateParser::parse(final CharArray input) {
        matcher.reset(input);
        int offset = 0;
        int oldEnd = -1;
        while (matcher.find(offset)) {
       // ....
        }
        if (offset != input.length()) {
            throw error(offset);
        }
    }

every time I see matcher.re() is equal to (?<month>september)\W+(?<day>\d{1,2})(?:th)?\W* with offset equal to 12 instead of 14 and, definitely, this doesn't cover whole template.

Is any way to force matching by a longest match instead of taking first one? Or give a bunch of matches instead of a total break?

sisyphsu commented 3 years ago

I have upgrade version to 1.0.7, it could fix this error.

(?<month>september)\W+(?<day>\d{1,2})(?:th)?\W* has been changed to (?<month>september)\W+(?<day>\d{1,2})(?:th)?\b, it could avoid this conflict.