pistacheio / pistache

A high-performance REST toolkit written in C++
https://pistacheio.github.io/pistache/
Apache License 2.0
3.12k stars 690 forks source link

Accept header parsing is locale-dependent, fails for valid values #1076

Closed rems4e closed 2 years ago

rems4e commented 2 years ago

My default Firefox Accept header looks like this: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8.

I suspect it's similar for a lot of people.

Pistache fails to parse the header, because of the q=0.9 (and q=0.8 for the matter) quality factor.

Stepping through the debugger, I traced the problem here: https://github.com/pistacheio/pistache/blob/master/src/common/mime.cc#L243

match_double internally calls strtod, which is locale dependent. My locale has , as a decimal separator, which means strtod only consumes the "0" of "0.9", leaving the "." as the next token, and thus breaking the parser, resulting in the "Unfinished Media Type parameter".

The solution to this would be parsing the quality factor in a locale-independent way, as seems correct (see https://httpwg.org/specs/rfc7231.html#rfc.section.5.3.1).

To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's from_chars (i.e. temporarily changing the locale before parsing, and restoring it after).

I'd be happy to provide the fix if the solution seems ok.

dennisjenkins75 commented 2 years ago

This is an excellent bug report. Please feel free to submit a PR. Be sure to include unit tests that exercise your logic changes. Try setting a few different locales in the test, as I suspect that our CI unit tests run with locale=en.US (or whatever).

kiplingw commented 2 years ago

Very good find @rems4e. Yes, I concur with my colleague. It would be a good idea to submit a unit test or tests to verify it works with different locales. One thing that might get tricky is LANG_C will always be present, but another locale might not be on the machine running the test. So you might have to get creative.

dennisjenkins75 commented 2 years ago

We should update our CI unit test env to install some additional locales then. Maybe one European locale, one from Asia, one RTL one (Hebrew or Arabic) and something from Africa.

kiplingw commented 2 years ago

Yeah I thought about that already. It's very easy to do in a DEP-8 CI environment, you just add some other locales to the list of build dependencies in debian/control. But for people building and testing a local tree, it would be weird if they were expected to manually install Estonian, Tagalog, or something like that.

dennisjenkins75 commented 2 years ago

Usually, unit tests are supposed to be rather static and not dynamic. However, in this case we have a few options:

  1. Enumerate the local locales, run the core test with a few selected at random, plus whatever locale matches the HTTP spec for parsing this double field.
  2. Rewrite the code to use a custom double parser that adheres to the spec, and ignores locale and do NOT use a locale-aware parser.
  3. Proceed with the unit tests and code patch as planned, and hard-code a small list of locales to test from. If any of those locales are not present, then either ASSERT (fail) the test, or log a non-fatal warning that locale testing was skipped. Either way, the official test env will have them, as you cited above.

I'm somewhat in favor of option #2. If the HTTP spec has a specific format for floating-point numbers that ignores the calling thread's current locale, then we should just code out own. Its ~15 lines of code or less, and some unit tests to verify correctness.

Tachi107 commented 2 years ago

To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's from_chars (i.e. temporarily changing the locale before parsing, and restoring it after).

Why not using std::from_chars directly? We target C++17

dennisjenkins75 commented 2 years ago

To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's from_chars (i.e. temporarily changing the locale before parsing, and restoring it after).

Why not using std::from_chars directly? We target C++17

  1. To avoid all of the complexity of saving a locale, switching it, then switching back.
  2. (assumed) Performance: Pistachio is supposed to be focused on performance.

Parsing a double of the form "nnnn[.nnnnn]" is trivially easy to code in a non-locale dependent manner.

Tachi107 commented 2 years ago
  • To avoid all of the complexity of saving a locale, switching it, then switching back.

from_chars is supposed to ignore locales, not to switch between them

  • (assumed) Performance: Pistachio is supposed to be focused on performance.

Quoting cppreference, _"std::fromchars is locale-independent, non-allocating, and non-throwing. Only a small subset of parsing policies used by other libraries (such as std::sscanf) is provided. This is intended to allow the fastest possible implementation that is useful in common high-throughput contexts such as text-based interchange"

Unfortunately it seems that libstdc++ is completely ignoring this fact and simply using std::strtod plus some locale trickery... Pretty disappointing :/

kiplingw commented 2 years ago

On Fri, 2022-05-27 at 13:28 -0700, Andrea Pappacoda wrote:

Unfortunately it seems that libstdc++ is completely ignoring this fact and simply using std::strtod plus some locale trickery... Pretty disappointing :/

That might be worth filing an upstream bug report with gcc. They're usually pretty good about dealing with things like this.

-- Kip Warner OpenPGP signed/encrypted mail preferred https://www.thevertigo.com

Tachi107 commented 2 years ago

That might be worth filing an upstream bug report with gcc. They're usually pretty good about dealing with things like this.

@kiplingw they seem to already know how bad their implementation is; there's a FIXME at the beginning of the file and the original patch message clearly says that the implementation is an ugly hack: https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550331.html

  • (assumed) Performance: Pistachio is supposed to be focused on performance.

@dennisjenkins75 if we're concerned about performance we could (as you say) implement a small "parser", especially because the HTTP standard specifies that the quality value is a really, really simple number (0 ≤ q ≤ 1, q has no more than three digits after the dot). Worth mentioning that there are good implementation of from_chars, like fast_float

rems4e commented 2 years ago

btw, https://en.cppreference.com/w/cpp/compiler_support#cpp17 says that from_chars with FP support is essentially only supported by g++ 11, and not at all with clang, which is most certainly a deal breaker.

Tachi107 commented 2 years ago

btw, https://en.cppreference.com/w/cpp/compiler_support#cpp17 says that from_chars with FP support is essentially only supported by g++ 11, and not at all with clang, which is most certainly a deal breaker.

It is not supported with libc++, i.e. on MacOS. Clang uses GCC's libstdc++ on Linux by default.

edit: I didn't know the from_chars situation sucked this much though

-- OpenPGP key: 66DE F152 8299 0C21 99EF A801 A8A1 28A8 AB1C EE49

rems4e commented 2 years ago

That doesn't change much, does it? It's unavailable for probably something like 90%+ Linux users, in addition to being a hack for the remaining 10%.

Tachi107 commented 2 years ago

What are you referring to?

Edit: oh, I got it, missed the libstdc++11 part

Tachi107 commented 2 years ago

@dennisjenkins75 if we're concerned about performance we could (as you say) implement a small "parser", especially because the HTTP standard specifies that the quality value is a really, really simple number (0 ≤ q ≤ 1, q has no more than three digits after the dot).

Naive implementation that surely has a lot of bugs:

bool str_to_qvalue(const char* str, float* qvalue) {
    constexpr unsigned int offset = 48;

    size_t str_len = 0;

    for (; str_len < 5; str_len++) {
        if (str[str_len] != '.' && !std::isdigit(str[str_len])) {
            break;
        }
    }

    if (str_len < 1 || str_len == 2 || str_len > 5) {
        return false;
    }

    if (str[0] != '0' && str[0] != '1') {
        return false;
    }

    *qvalue = 0;

    switch (str_len) {
    case 5:
        *qvalue += (str[4] - offset) / 1000.0F;
    case 4:
        *qvalue += (str[3] - offset) / 100.0F;
    case 3:
        if (str[1] != '.') {
            return false;
        }
        *qvalue += (str[2] - offset) / 10.0F;
    case 1:
        *qvalue += str[0] - offset;
    };

    if (*qvalue > 1) {
        return false;
    }

    return true;
}