Closed rems4e closed 2 years ago
This is an excellent bug report. Please feel free to submit a PR. Be sure to include unit tests that exercise your logic changes. Try setting a few different locales in the test, as I suspect that our CI unit tests run with locale=en.US (or whatever).
Very good find @rems4e. Yes, I concur with my colleague. It would be a good idea to submit a unit test or tests to verify it works with different locales. One thing that might get tricky is LANG_C will always be present, but another locale might not be on the machine running the test. So you might have to get creative.
We should update our CI unit test env to install some additional locales then. Maybe one European locale, one from Asia, one RTL one (Hebrew or Arabic) and something from Africa.
Yeah I thought about that already. It's very easy to do in a DEP-8 CI environment, you just add some other locales to the list of build dependencies in debian/control
. But for people building and testing a local tree, it would be weird if they were expected to manually install Estonian, Tagalog, or something like that.
Usually, unit tests are supposed to be rather static and not dynamic. However, in this case we have a few options:
I'm somewhat in favor of option #2. If the HTTP spec has a specific format for floating-point numbers that ignores the calling thread's current locale, then we should just code out own. Its ~15 lines of code or less, and some unit tests to verify correctness.
To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's
from_chars
(i.e. temporarily changing the locale before parsing, and restoring it after).
Why not using std::from_chars
directly? We target C++17
To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's
from_chars
(i.e. temporarily changing the locale before parsing, and restoring it after).Why not using
std::from_chars
directly? We target C++17
Parsing a double of the form "nnnn[.nnnnn]" is trivially easy to code in a non-locale dependent manner.
- To avoid all of the complexity of saving a locale, switching it, then switching back.
from_chars
is supposed to ignore locales, not to switch between them
- (assumed) Performance: Pistachio is supposed to be focused on performance.
Quoting cppreference, _"std::fromchars is locale-independent, non-allocating, and non-throwing. Only a small subset of parsing policies used by other libraries (such as std::sscanf) is provided. This is intended to allow the fastest possible implementation that is useful in common high-throughput contexts such as text-based interchange"
Unfortunately it seems that libstdc++ is completely ignoring this fact and simply using std::strtod
plus some locale trickery... Pretty disappointing :/
On Fri, 2022-05-27 at 13:28 -0700, Andrea Pappacoda wrote:
Unfortunately it seems that libstdc++ is completely ignoring this fact and simply using std::strtod plus some locale trickery... Pretty disappointing :/
That might be worth filing an upstream bug report with gcc. They're usually pretty good about dealing with things like this.
-- Kip Warner OpenPGP signed/encrypted mail preferred https://www.thevertigo.com
That might be worth filing an upstream bug report with gcc. They're usually pretty good about dealing with things like this.
@kiplingw they seem to already know how bad their implementation is; there's a FIXME at the beginning of the file and the original patch message clearly says that the implementation is an ugly hack: https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550331.html
- (assumed) Performance: Pistachio is supposed to be focused on performance.
@dennisjenkins75 if we're concerned about performance we could (as you say) implement a small "parser", especially because the HTTP standard specifies that the quality value is a really, really simple number (0 ≤ q ≤ 1, q has no more than three digits after the dot). Worth mentioning that there are good implementation of from_chars, like fast_float
btw, https://en.cppreference.com/w/cpp/compiler_support#cpp17 says that from_chars with FP support is essentially only supported by g++ 11, and not at all with clang, which is most certainly a deal breaker.
btw, https://en.cppreference.com/w/cpp/compiler_support#cpp17 says that from_chars with FP support is essentially only supported by g++ 11, and not at all with clang, which is most certainly a deal breaker.
It is not supported with libc++, i.e. on MacOS. Clang uses GCC's libstdc++ on Linux by default.
edit: I didn't know the from_chars situation sucked this much though
-- OpenPGP key: 66DE F152 8299 0C21 99EF A801 A8A1 28A8 AB1C EE49
That doesn't change much, does it? It's unavailable for probably something like 90%+ Linux users, in addition to being a hack for the remaining 10%.
What are you referring to?
Edit: oh, I got it, missed the libstdc++11 part
@dennisjenkins75 if we're concerned about performance we could (as you say) implement a small "parser", especially because the HTTP standard specifies that the quality value is a really, really simple number (0 ≤ q ≤ 1, q has no more than three digits after the dot).
Naive implementation that surely has a lot of bugs:
bool str_to_qvalue(const char* str, float* qvalue) {
constexpr unsigned int offset = 48;
size_t str_len = 0;
for (; str_len < 5; str_len++) {
if (str[str_len] != '.' && !std::isdigit(str[str_len])) {
break;
}
}
if (str_len < 1 || str_len == 2 || str_len > 5) {
return false;
}
if (str[0] != '0' && str[0] != '1') {
return false;
}
*qvalue = 0;
switch (str_len) {
case 5:
*qvalue += (str[4] - offset) / 1000.0F;
case 4:
*qvalue += (str[3] - offset) / 100.0F;
case 3:
if (str[1] != '.') {
return false;
}
*qvalue += (str[2] - offset) / 10.0F;
case 1:
*qvalue += str[0] - offset;
};
if (*qvalue > 1) {
return false;
}
return true;
}
My default Firefox
Accept
header looks like this:Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
.I suspect it's similar for a lot of people.
Pistache fails to parse the header, because of the
q=0.9
(andq=0.8
for the matter) quality factor.Stepping through the debugger, I traced the problem here: https://github.com/pistacheio/pistache/blob/master/src/common/mime.cc#L243
match_double
internally callsstrtod
, which is locale dependent. My locale has,
as a decimal separator, which meansstrtod
only consumes the "0" of "0.9", leaving the "." as the next token, and thus breaking the parser, resulting in the "Unfinished Media Type parameter".The solution to this would be parsing the quality factor in a locale-independent way, as seems correct (see https://httpwg.org/specs/rfc7231.html#rfc.section.5.3.1).
To do so, I propose getting inspiration from g++ stdlib here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/src/c%2B%2B17/floating_from_chars.cc#L331, the underlying implementation of C++17's
from_chars
(i.e. temporarily changing the locale before parsing, and restoring it after).I'd be happy to provide the fix if the solution seems ok.