Float parsing uses `strtod`, but rline calls `setlocale`

kseistrup commented 8 months ago

Re.: commit f1d7bdaa5a85cbba913495deb8f22c29ef0bca3e

It seems math.ceil(x) and math.floor(x) returns the same value for a given x

>>> import math
>>> math.ceil(0.1) == math.floor(0.1)  # 1 == 0?
 => True
>>> math.ceil(0.9) == math.floor(0.9)  # 1 == 0?
 => True

klange commented 8 months ago

Can not reproduce?

>>> math.floor(0.1) == math.ceil(0.1)
 => False
>>> math.floor(0.9) == math.ceil(0.9)
 => False
>>> math.floor(0.1)
 => 0
>>> math.ceil(0.1)
 => 1
>>> math.ceil(0.9)
 => 1
>>> math.floor(0.9)
 => 0

These functions are simple wrappers around libm math functions and appear to be working as intended on the platforms I have tested on.

kseistrup commented 8 months ago

I believe I have found the general area where the culprit is lurking, although I don't know what is going on, or how to fix it.

I have focused solely on ceil(0.5), which should always return 1.

If I launch kuroko with the environment variable $LC_ALL set to one of C, POSIX, en_US.UTF-8, en_GB.UTF-8, or en_CA.UTF-8 (i.e., env LC_ALL=x_Y.UTF-8 kuroko), ceil(0.5) will give the right result: 1.

If I launch kuroko with any of en_DK.UTF-8 (the default here), da_DK.UTF-8, nb_NO.UTF-8, nn_NO.UTF-8, or sv_SE.UTF-8, ceil(0.5) will give the wrong result: 0.

(These were the available locales on my boxen.)

My Python interpreter, as well as the S-Lang shell slsh, both of which are linked to libm, give the right result under all locales.

I'm gonna go out on a limb here, although it could be totally wrong:

At least one of the characteristics of the locales that give the right result is that they use . (FULL STOP) as decimal separator, whereas all the locales that give with the wrong result are using , (COMMA) as decimal separator.

Now, irregardless of the locale, kuroko does not accept ceil(0,5) as valid input (ArgumentError: ceil() expects one argument), but could some other part of kuroko become confused if the decimal separator is something else than a FULL STOP?

klange commented 8 months ago

Using Kuroko with locales other than C is currently unsupported - strtod is used to parse floats.

Please confirm that you get the misparse of 0.5 as 0.0 in the repl.

klange commented 8 months ago

This should not normally be an issue when running scripts, but rline calls setlocale: https://github.com/kuroko-lang/kuroko/blob/master/src/vendor/rline.c#L2433

If you wish to use the repl with a locale other than C, please disable rline with the the -r option.

kseistrup commented 8 months ago

Using Kuroko with locales other than C is currently unsupported

Well, that was an unfortunate turn of events…

If you wish to use the repl with a locale other than C, please disable rline with the the -r option.

Perhaps rline should not call setlocale() when it is not supported?

There's something I don't understand: When rline.c explicitly sets $LC_ALL to C:

    setlocale(LC_ALL, "C.UTF-8");

why is the locale of the environment a problem at all?

kseistrup commented 8 months ago

PS: Oh, and yes: the REPL does interpret 0.5 as 0.0.

klange commented 8 months ago

Perhaps rline should not call setlocale() when it is not supported?

rline needs to run setlocale or wcwidth will return nonsensical values for the widths of characters, and it becomes difficult to type, eg., Japanese. I may modify rline to go through the steps of saving the current locale before it calls setlocale, and restoring it afterwards, which may resolve this.

There's something I don't understand: When rline.c explicitly sets $LC_ALL to C:

rline only does that on Windows. On other platforms, it calls setlocale with "", as specified in the manpage:

If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables.

rline, of course, does not care about things like numeric formatting, and was originally written for a platform that had no locale support...

kseistrup commented 8 months ago

rline only does that on Windows

Ah yes, I misread the #ifndef as #ifdef. My fault.

I'm not sure I'm prepared to give up rline, so I have mitigated the problem by aliasing kuroko to env LC_ALL=C.utf8 /usr/bin/kuroko $argv in fish (my interactive shell).

klange commented 8 months ago

Changing the title of this issue to reflect the underlying problem, so it can be more directly addressed.

Given that I want Kuroko to be usable in an embedded environment where someone else may have already been using setlocale, fixing rline to deal with locales in a re-entrant manner is only half of a fix... ideally, the float parsing needs to be in-housed. There's also the issue of printf formatting. In both cases, there is a brute-force option available: determine the radix symbol and replace as needed. This still leaves many edge cases in both directions, though.

I have a locale-oblivious strtod here: https://github.com/klange/toaruos/blob/master/libc/stdlib/strtod.c - unfortunately, this implementation is inaccurate, and crafting a good one is tricky due to floating point precision loss.

I don't have a complete in-house float printer - the one in ToaruOS is very incomplete.

kseistrup commented 8 months ago

I wonder if the issue could be mitigated for “everyone” in this way [until a more solid fix is in place]:

Do the Windows/non-Windows setlocale()-thing as it is now, then (in some kind of pseudo-C):

if (get_decimal_separator() != '.') {
  setlocale(LC_ALL, "C.UTF-8");
  warn_user_that_locale_has_been_changed();
}

klange commented 8 months ago

Possibly.

Checking the radix symbol in a portable manner is slightly tricky, but a quick hack might be to do if (strtod("0.5",NULL) != 0.5) setlocale(LC_ALL, "C.UTF-8");, just in rline, and see where that gets us.

klange commented 8 months ago

Not going to print a warning for now, especially since rline does its locale setting every time a new prompt is shown, but see if this helps for you: https://github.com/kuroko-lang/kuroko/commit/0c8cb94b6acb52f4b9f00cda4f4b65bd3b02bf94

kseistrup commented 8 months ago

Thanks, that quick hack works great for me. :+1:

Is it really necessary to set the locale for each and every time rline() has been called? Could “we” just set it upon first invocation and flag that the locale has been set? That way it's just a matter of checking a boolean variable.

Checking the radix symbol in a portable manner is slightly tricky

Could something like this be called portable (disclaimer, I don't speak C):

#include <locale.h>
#include <stdio.h>

int main() {
  (void) setlocale(LC_ALL, "");
  struct lconv *lc = localeconv();
  (void) printf("Decimal separator: %s\n", lc->decimal_point);
  return 0;
}

:paperclip: radix.c

kseistrup commented 8 months ago

E.g.:

#ifndef _WIN32
   static int locale_has_been_initialized = 0;
   if (locale_has_been_initialized == 0) {
      (void) setlocale(LC_ALL, "");
      struct lconv *lc = localeconv();
      if (strcmp(lc->decimal_point, ".") != 0) {
         (void) setlocale(LC_ALL, "C.UTF-8");
      }
      locale_has_been_initialized = 1;
   }
#else

kseistrup commented 8 months ago

Come to think of it: kuroko should not behave differently in a script and in the REPL. The path of least surprise is to setlocale() at the top the main() function, before arguments are parsed and other routines are called.

Let me see if I can provide a sensible pull-request…

klange commented 8 months ago

This misunderstands the relationship between rline and Kuroko. rline is not part of Kuroko, and it must independently ensure that it is operating in a viable locale to perform its function as an interactive line editor. What rline should do is restore the original locale from before it was called when resuming to the caller.

Meanwhile, as long as Kuroko relies on the functionality of strtod and other locale-dependent standard library functions, it should enforce the C locale. Ideally, Kuroko should not rely on these things and use its own implementations. However, as the main roadblock here is strtod and that is not a simple thing to implement well, this is not something that can be done in the short term.

kseistrup commented 8 months ago

Thanks, I get your point. Makes sense.

Meanwhile you could consider if the call to localeconv() wouldn't be a more portable, and certainly less kludgy, way of make sure that the decimal point is a full stop.

klange commented 8 months ago

My current thought for an improved kludge is to have rline only set the LC_CTYPE locale setting, and only if it determines the current locale does not think (wchar_t)0x3024 (あ) is 2 cells wide. Kuroko does not make use of functions that rely on the LC_CTYPE locale, only LC_NUMERIC.

Once Kuroko is free of locale-dependent functions on its own, the C locale functionality can be exposed which will allow the locale to be set when calling, eg., time.strftime (which currently suffers from a similar issue where rline's use of setlocale(LC_ALL, ...) means the repl operates differently).

Looking through the codebase, functions which may be locale-dependent include:

strtod, used in the compiler and string.__float__, as discussed here.
Calls to printf-family functions with %.16g to format floats (generally only in float.__repr__, though also in the debug output printing to "safely" print float objects, but that does not matter).
Possibly calls to atoi (used in the interactive debugger, and once to parse a command-line argument) and strtoul (used once in the compiler to parse hex escapes in strings, and once in str.format to parse argument position numbers in format strings), which may erroneously accept digit separators in some locales, though I don't think any of these cases is problematic and all locales are required to handle ASCII western Arabic numerals in the expected manner.

kseistrup commented 8 months ago

You know your code infinitely better than I do, so let me ask you this naive question:

Would such a locale exist that considers HIRAGANA LETTER A (あ) “2 cells wide”, that would miscalculate e.g. LOTUS (🪷)?

I recently had a problem with the editor JED not being able to correctly edit a line containing the LOTUS character:

https://github.com/jedsoft/jed/issues/9

Perhaps HIRAGANA LETTER A will be enough to asses a locale, but I thought I'd mention it anyway.

klange commented 8 months ago

This is a complicated question. In order for any program to correctly handle this, its own wcwidth results must match the terminal's wcwidth results (or whatever the terminal uses to determine the widths of displayed text - which might not even be wcwidth!). If the terminal and the program disagree on locales, these results might differ. Worse yet, if the terminal and the program are running on completely different systems (eg., the program is running over ssh), then they may have completely different implementations of wcwidth even with the same locale, using different databases. Emoji are a common stumbling point here: earlier databases will claim their width is 1, while newer ones may say 2, and if the two sides disagree then things can get weird.

klange commented 8 months ago

I have made some progress towards implementing a proper in-house strtod by making use of Kuroko's existing bigints: taking the decimal representation of the significant digits and dividing them by the appropriate power of 10, extracting enough bits to fill in a double. This work will likely make its way into the implementation of long.__truediv__, which currently cheats and may produce incorrect results as it converts both operands to doubles before dividing. It will take some more work to turn this into a strtod implementation suitable for Kuroko's use cases, but the end result should be a robust and accurate replacement that does not depend on locale settings.

kseistrup commented 8 months ago

Thank you for your patience, it's highly appreciated. :pray:

klange commented 8 months ago

The new long.__truediv__ implementation has been pushed, so at least now you can accurately write floats as the division of two large integers and that should always work regardless of any locale interventions. I'm not 100% certain the rounding behavior is correct, but it worked for some edge case examples, and supports subnormals. With this, a pure-Kuroko implementation of float parsing should be possible as well (though you'd need to pull in math to return appropriate nan and inf values), and the final implementations of the compiler's parser and str.__float__ will be in C anyway).

I also deployed the described change to rline, so the time module should no longer behave differently in the repl and will always produce C/English output (until we have a locale module to manipulate the locale settings).

kseistrup commented 8 months ago

Great news, thanks! :pray:

klange commented 8 months ago

I am pleased to note that in the in-progress in-house-float-conversions branch (https://github.com/kuroko-lang/kuroko/tree/in-house-float-conversions), we now have:

Accurately rounded long.__truediv__, which means double-precision floats can be accurately represented as fractions or other integer-based expressions.
A new float printer, with support for various formatting options, with no locale dependencies.
A new float parser, which makes use of the long.__truediv__ support, and has no locale dependencies.
Support for e exponents in float literals in the compiler.
Much faster implementations of long.__rshift__ and long.__lshift__.

There is still some work to do to clean up and verify all of this before merging it.

klange commented 8 months ago

That branch has been merged, and Kuroko no longer uses strtod to parse floats, or printf to convert them to strings.

kuroko-lang / kuroko

Float parsing uses `strtod`, but rline calls `setlocale` #47