pocoproject / poco

The POCO C++ Libraries are powerful cross-platform C++ libraries for building network- and internet-based applications that run on desktop, server, mobile, IoT, and embedded systems.
https://pocoproject.org
Other
8.04k stars 2.11k forks source link

Significant performance degradation of Poco::DateTimeParser #4592

Open tyler92 opened 1 week ago

tyler92 commented 1 week ago

Consider the following code:

#include <Poco/DateTimeFormat.h>
#include <Poco/DateTimeFormatter.h>
#include <Poco/DateTimeParser.h>
#include <stdexcept>

std::string toString(const Poco::DateTime& dt)
{
    return Poco::DateTimeFormatter::format(dt, Poco::DateTimeFormat::ISO8601_FRAC_FORMAT);
}

Poco::DateTime parsePoco(const std::string& text)
{
    int tzd = 0;
    const auto format = Poco::DateTimeFormat::ISO8601_FRAC_FORMAT;
    auto parsed = Poco::DateTimeParser::parse(format, text, tzd);
    parsed.makeUTC(tzd);
    return parsed;
}

void ASSERT_EQ(const std::string& a, const std::string& b)
{
    if (a != b)
    {
        throw std::runtime_error("Test FAILED");
    }
}

int main()
{
    for (int i = 0; i < 100'000; ++i)
    {
        auto parsed = parsePoco("2024-06-21T09:43:11.340159Z");
        ASSERT_EQ(toString(parsed), "2024-06-21T09:43:11.340159Z");

        parsed = parsePoco("2024-06-21T09:43:11.340159+07:00");
        ASSERT_EQ(toString(parsed), "2024-06-21T02:43:11.340159Z");
    }

    return 0;
}

Execution time (Release configuration):

Poco 1.12.5:    96 ms
Poco 1.13.3: 17179 ms

With the format Poco::DateTimeFormat::ISO8601_FRAC_FORMAT + " " (with an additional space) execution time becomes better for 1.13.3.

I guess the reason is the following commit: https://github.com/pocoproject/poco/commit/4f1cf683079cc3b00be96259fc42f5c291ab4c77 (https://github.com/pocoproject/poco/pull/4330).

Up to 8 regex expressions are compiled on each call

https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeFormat.cpp#L153-L156

if the provided format is known:

https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L47

aleks-f commented 1 week ago

Yeah, well - it's either that or we silently accept garbage, as we used to. If you have an improvement proposal, send a PR

andrewauclair commented 1 week ago
bool DateTimeFormat::isValid(const std::string& dateTime)
{

        static const RegularExpression regs[] = {
                RegularExpression(DateTimeFormat::ISO8601_REGEX),
                RegularExpression(DateTimeFormat::RFC822_REGEX),
                RegularExpression(DateTimeFormat::RFC1123_REGEX),
                RegularExpression(DateTimeFormat::HTTP_REGEX),
                RegularExpression(DateTimeFormat::RFC850_REGEX),
                RegularExpression(DateTimeFormat::RFC1036_REGEX),
                RegularExpression(DateTimeFormat::ASCTIME_REGEX),
                RegularExpression(DateTimeFormat::SORTABLE_REGEX)
        };

        for (const auto& f : regs)
        {
                if (f.match(dateTime)) return true;
        }
        return false;
}

Creating the RegularExpressions once in isValid takes it from over 12s to .16s for me.

tyler92 commented 1 week ago

Why are all regexes checked, and not just one? E.g. for the following code

const auto format = Poco::DateTimeFormat::ISO8601_FRAC_FORMAT;
auto parsed = Poco::DateTimeParser::parse(format, text, tzd);

I would expect that only ISO8601_REGEX will be checked. And why is user input not checked against these regexes if DateTimeFormat::hasFormat(fmt) is false?

tyler92 commented 1 week ago

About handling garbage input. Let's consider the following code:

https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L108-L110

https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L25-L26

https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L37-L38

Points for discussion:

  1. SKIP_JUNK accepts garbage input
  2. PARSE_NUMBER_N doesn't report an error if the input doesn't contain a number at all

I hope the parser body can be improved so that it is more strict and regular expressions can be avoided.

aleks-f commented 1 week ago

@andrewauclair can you look at this issue and propose improvements?