Closed kseistrup closed 8 months ago
Can not reproduce?
>>> math.floor(0.1) == math.ceil(0.1)
=> False
>>> math.floor(0.9) == math.ceil(0.9)
=> False
>>> math.floor(0.1)
=> 0
>>> math.ceil(0.1)
=> 1
>>> math.ceil(0.9)
=> 1
>>> math.floor(0.9)
=> 0
These functions are simple wrappers around libm math functions and appear to be working as intended on the platforms I have tested on.
I believe I have found the general area where the culprit is lurking, although I don't know what is going on, or how to fix it.
I have focused solely on ceil(0.5)
, which should always return 1
.
If I launch kuroko with the environment variable $LC_ALL
set to one of C
, POSIX
, en_US.UTF-8
, en_GB.UTF-8
, or en_CA.UTF-8
(i.e., env LC_ALL=x_Y.UTF-8 kuroko
), ceil(0.5)
will give the right result: 1
.
If I launch kuroko with any of en_DK.UTF-8
(the default here), da_DK.UTF-8
, nb_NO.UTF-8
, nn_NO.UTF-8
, or sv_SE.UTF-8
, ceil(0.5)
will give the wrong result: 0
.
(These were the available locales on my boxen.)
My Python interpreter, as well as the S-Lang shell slsh
, both of which are linked to libm, give the right result under all locales.
I'm gonna go out on a limb here, although it could be totally wrong:
At least one of the characteristics of the locales that give the right result is that they use .
(FULL STOP
) as decimal separator, whereas all the locales that give with the wrong result are using ,
(COMMA
) as decimal separator.
Now, irregardless of the locale, kuroko does not accept ceil(0,5)
as valid input (ArgumentError: ceil() expects one argument
), but could some other part of kuroko become confused if the decimal separator is something else than a FULL STOP
?
Using Kuroko with locales other than C
is currently unsupported - strtod
is used to parse floats.
Please confirm that you get the misparse of 0.5
as 0.0
in the repl.
This should not normally be an issue when running scripts, but rline calls setlocale
: https://github.com/kuroko-lang/kuroko/blob/master/src/vendor/rline.c#L2433
If you wish to use the repl with a locale other than C
, please disable rline with the the -r
option.
Using Kuroko with locales other than
C
is currently unsupported
Well, that was an unfortunate turn of events…
If you wish to use the repl with a locale other than
C
, please disable rline with the the -r option.
Perhaps rline should not call setlocale()
when it is not supported?
There's something I don't understand: When rline.c
explicitly sets $LC_ALL
to C:
setlocale(LC_ALL, "C.UTF-8");
why is the locale of the environment a problem at all?
PS: Oh, and yes: the REPL does interpret 0.5
as 0.0
.
Perhaps rline should not call
setlocale()
when it is not supported?
rline needs to run setlocale
or wcwidth
will return nonsensical values for the widths of characters, and it becomes difficult to type, eg., Japanese. I may modify rline to go through the steps of saving the current locale before it calls setlocale
, and restoring it afterwards, which may resolve this.
There's something I don't understand: When rline.c explicitly sets $LC_ALL to C:
rline only does that on Windows. On other platforms, it calls setlocale
with ""
, as specified in the manpage:
If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables.
rline, of course, does not care about things like numeric formatting, and was originally written for a platform that had no locale support...
rline only does that on Windows
Ah yes, I misread the #ifndef
as #ifdef
. My fault.
I'm not sure I'm prepared to give up rline, so I have mitigated the problem by aliasing kuroko
to env LC_ALL=C.utf8 /usr/bin/kuroko $argv
in fish (my interactive shell).
Changing the title of this issue to reflect the underlying problem, so it can be more directly addressed.
Given that I want Kuroko to be usable in an embedded environment where someone else may have already been using setlocale
, fixing rline
to deal with locales in a re-entrant manner is only half of a fix... ideally, the float parsing needs to be in-housed. There's also the issue of printf
formatting. In both cases, there is a brute-force option available: determine the radix symbol and replace as needed. This still leaves many edge cases in both directions, though.
I have a locale-oblivious strtod
here: https://github.com/klange/toaruos/blob/master/libc/stdlib/strtod.c - unfortunately, this implementation is inaccurate, and crafting a good one is tricky due to floating point precision loss.
I don't have a complete in-house float printer - the one in ToaruOS is very incomplete.
I wonder if the issue could be mitigated for “everyone” in this way [until a more solid fix is in place]:
Do the Windows/non-Windows setlocale()
-thing as it is now, then (in some kind of pseudo-C):
if (get_decimal_separator() != '.') {
setlocale(LC_ALL, "C.UTF-8");
warn_user_that_locale_has_been_changed();
}
Possibly.
Checking the radix symbol in a portable manner is slightly tricky, but a quick hack might be to do if (strtod("0.5",NULL) != 0.5) setlocale(LC_ALL, "C.UTF-8");
, just in rline, and see where that gets us.
Not going to print a warning for now, especially since rline
does its locale setting every time a new prompt is shown, but see if this helps for you: https://github.com/kuroko-lang/kuroko/commit/0c8cb94b6acb52f4b9f00cda4f4b65bd3b02bf94
Thanks, that quick hack works great for me. :+1:
Is it really necessary to set the locale for each and every time rline()
has been called? Could “we” just set it upon first invocation and flag that the locale has been set? That way it's just a matter of checking a boolean variable.
Checking the radix symbol in a portable manner is slightly tricky
Could something like this be called portable (disclaimer, I don't speak C):
#include <locale.h>
#include <stdio.h>
int main() {
(void) setlocale(LC_ALL, "");
struct lconv *lc = localeconv();
(void) printf("Decimal separator: %s\n", lc->decimal_point);
return 0;
}
:paperclip: radix.c
E.g.:
#ifndef _WIN32
static int locale_has_been_initialized = 0;
if (locale_has_been_initialized == 0) {
(void) setlocale(LC_ALL, "");
struct lconv *lc = localeconv();
if (strcmp(lc->decimal_point, ".") != 0) {
(void) setlocale(LC_ALL, "C.UTF-8");
}
locale_has_been_initialized = 1;
}
#else
Come to think of it: kuroko
should not behave differently in a script and in the REPL. The path of least surprise is to setlocale()
at the top
the main()
function, before arguments are parsed and other routines are called.
Let me see if I can provide a sensible pull-request…
This misunderstands the relationship between rline
and Kuroko. rline
is not part of Kuroko, and it must independently ensure that it is operating in a viable locale to perform its function as an interactive line editor. What rline
should do is restore the original locale from before it was called when resuming to the caller.
Meanwhile, as long as Kuroko relies on the functionality of strtod
and other locale-dependent standard library functions, it should enforce the C
locale. Ideally, Kuroko should not rely on these things and use its own implementations. However, as the main roadblock here is strtod
and that is not a simple thing to implement well, this is not something that can be done in the short term.
Thanks, I get your point. Makes sense.
Meanwhile you could consider if the call to localeconv()
wouldn't be a more portable, and certainly less kludgy, way of make sure that the decimal point is a full stop.
My current thought for an improved kludge is to have rline
only set the LC_CTYPE
locale setting, and only if it determines the current locale does not think (wchar_t)0x3024
(あ) is 2 cells wide. Kuroko does not make use of functions that rely on the LC_CTYPE
locale, only LC_NUMERIC
.
Once Kuroko is free of locale-dependent functions on its own, the C locale functionality can be exposed which will allow the locale to be set when calling, eg., time.strftime
(which currently suffers from a similar issue where rline
's use of setlocale(LC_ALL, ...)
means the repl operates differently).
Looking through the codebase, functions which may be locale-dependent include:
strtod
, used in the compiler and string.__float__
, as discussed here.printf
-family functions with %.16g
to format floats (generally only in float.__repr__
, though also in the debug output printing to "safely" print float objects, but that does not matter).atoi
(used in the interactive debugger, and once to parse a command-line argument) and strtoul
(used once in the compiler to parse hex escapes in strings, and once in str.format
to parse argument position numbers in format strings), which may erroneously accept digit separators in some locales, though I don't think any of these cases is problematic and all locales are required to handle ASCII western Arabic numerals in the expected manner.You know your code infinitely better than I do, so let me ask you this naive question:
Would such a locale exist that considers HIRAGANA LETTER A (あ) “2 cells wide”, that would miscalculate e.g. LOTUS (🪷)?
I recently had a problem with the editor JED not being able to correctly edit a line containing the LOTUS character:
Perhaps HIRAGANA LETTER A will be enough to asses a locale, but I thought I'd mention it anyway.
This is a complicated question. In order for any program to correctly handle this, its own wcwidth
results must match the terminal's wcwidth
results (or whatever the terminal uses to determine the widths of displayed text - which might not even be wcwidth
!). If the terminal and the program disagree on locales, these results might differ. Worse yet, if the terminal and the program are running on completely different systems (eg., the program is running over ssh), then they may have completely different implementations of wcwidth
even with the same locale, using different databases. Emoji are a common stumbling point here: earlier databases will claim their width is 1, while newer ones may say 2, and if the two sides disagree then things can get weird.
I have made some progress towards implementing a proper in-house strtod
by making use of Kuroko's existing bigints: taking the decimal representation of the significant digits and dividing them by the appropriate power of 10, extracting enough bits to fill in a double
. This work will likely make its way into the implementation of long.__truediv__
, which currently cheats and may produce incorrect results as it converts both operands to doubles before dividing. It will take some more work to turn this into a strtod
implementation suitable for Kuroko's use cases, but the end result should be a robust and accurate replacement that does not depend on locale settings.
Thank you for your patience, it's highly appreciated. :pray:
The new long.__truediv__
implementation has been pushed, so at least now you can accurately write floats as the division of two large integers and that should always work regardless of any locale interventions. I'm not 100% certain the rounding behavior is correct, but it worked for some edge case examples, and supports subnormals. With this, a pure-Kuroko implementation of float parsing should be possible as well (though you'd need to pull in math
to return appropriate nan
and inf
values), and the final implementations of the compiler's parser and str.__float__
will be in C anyway).
I also deployed the described change to rline
, so the time
module should no longer behave differently in the repl and will always produce C
/English output (until we have a locale
module to manipulate the locale settings).
Great news, thanks! :pray:
I am pleased to note that in the in-progress in-house-float-conversions
branch (https://github.com/kuroko-lang/kuroko/tree/in-house-float-conversions), we now have:
long.__truediv__
, which means double-precision floats can be accurately represented as fractions or other integer-based expressions.long.__truediv__
support, and has no locale dependencies.e
exponents in float literals in the compiler.long.__rshift__
and long.__lshift__
.There is still some work to do to clean up and verify all of this before merging it.
That branch has been merged, and Kuroko no longer uses strtod
to parse floats, or printf
to convert them to strings.
Re.: commit f1d7bdaa5a85cbba913495deb8f22c29ef0bca3e
It seems
math.ceil(x)
andmath.floor(x)
returns the same value for a givenx