sisyphsu / dateparser

dateparser is a smart and high-performance date parser library, it supports hundreds of different formats, nearly all format that we may used. And this is also a showcase for "retree" algorithm.
MIT License
95 stars 24 forks source link

Improve performance when parsing many strings in the same format #28

Open robin-xyzt-ai opened 1 year ago

robin-xyzt-ai commented 1 year ago

Proposal for https://github.com/sisyphsu/dateparser/issues/17 .

By keeping track of which rules were used to parse the first string, parsing the next strings can try to use a matcher that only uses a subset of those rules.

The case in the benchmark is between 2 and 3 times faster on my machine:

Benchmark                                                          Mode  Cnt     Score     Error  Units
OptimizeForReuseSimilarFormattedBenchmark.optimizedForReuseParser  avgt    6   462.362 ±  54.300  ms/op
OptimizeForReuseSimilarFormattedBenchmark.regularParser            avgt    6  1130.171 ± 162.117  ms/op
robin-xyzt-ai commented 1 year ago

I tried to make the code a bit more clear by leaving some additional comments and doing a bit more code cleanup.

Let me know if there are specific parts that are still unclear.

robin-xyzt-ai commented 1 year ago

Looks like this PR isn't ready to be merged. The following test fails:

    @Test
    void foo() {
        DateParser parser = DateParser.newBuilder().optimizeForReuseSimilarFormatted(true).build();
        String inputString = "2022-08-09 19:04:31.600000+00:00";
        assertEquals(parser.parseDate(inputString), parser.parseDate(inputString));
    }

I'm afraid it will require a fix for https://github.com/sisyphsu/dateparser/issues/29 first.