Closed bewantbe closed 4 years ago
Are you sure you're using the same version of g++ on WSL versus your Linux machine?
Confirmed it happens with consistent g++ version (6.1.1 20160510). Same ELF64 object generates byte-for-byte on native and WSL. It doesn't even call any math library functions. Looking at the assembly, everything that calculates t0
is inline, and it only escapes to call denorm_min()
and printf()
/puts()
, let alone leaves userspace for math. strace
is the same on both sides. No idea.... 🤷
It's been a while since I looked at this sort of thing (pre-x86_64), but IIRC Windows and Linux have slightly different calling conventions regarding saving things like the FPU control registers during/after syscalls? Maybe relevant?
Thanks for reporting the issue. Just looked at this, and it is a difference between the default floating point control register (i.e. FPCSR\NPXCW) between Linux and NT. glibc assumes 0x37f but NT uses 0x27f. To confirm, I added the following lines to the sample above to set the control register and it passed:
#include <fpu_control.h>
...
unsigned short Cw = 0x37f;
_FPU_SETCW(Cw);
MxCsr is the same between Linux and NT, but I'll poke around to see if there are other control registers that are mismatched. I'll file a bug on our side to match the default configuration on Linux.
FTHOI, I compiled the sample program in Linux, then copied the resulting executable to WSL. It produced different results under Linux and WSL, 1.00000000000000000e-05 under Linux but 1.00000000000000008e-05 under WSL. It's not g++ that's the problem.
Does WSL not support 80-bit floating point? Does WSL not use the FPU but only provide 64-bit software floating point math? FWLIW, awk under Linux produces the same output as Windows.
See the comment above yours. There is a difference between the floating point control register on Windows vs Linux. Steve is working on a fix.
@stehufntdev Thank you for the temporary solution (especially the magic number!!! :-)) and the on going work. With that fix, my (some other) computation executable now gives the same results under BashOnWindows as under Linux.
@therealkenc Thanks for the clarification, that's exactly what happend.
Just for completeness Windows version: 10.0.14393 The test programs are produced by either GCC 4.9, libc 2.19 (Debian 8.5 (jessie)) GCC 5.4, libc 2.23 (Debian testing (stretch)) with all libs statically linked.
Bumping this because it's a serious ABI issue and completely breaks any musl libc-linked binaries that use floating point scanf/strtod, among other things. It probably also breaks printf rounding of floating point values. We use long double internally for various reasons, including issues related to excess precision (on archs that have FLT_EVAL_METHOD>0), and the cost is irrelevant since the bulk of the code actually works with integers, but of course it badly breaks when run with an ABI-non-conforming control word setting or an emulator that fails to emulate x87 correctly.
@richfelker - Thanks for pinging this thread, this issue is still on your backlog. If possible could you describe a specific scenario that is not working because of this to help us prioritize this fix?
OK, I'll work out a specific strtod call that returns wrong result with x87 control word set incorrectly.
@benhillis, I think several tests in the glibc test suite (e.g., math/test-ldouble
, stdlib/tst-strtod-round
) fail due to this issue.
With the control word set wrong, strtod("4398046511105",0) returns 4398046511104. This represents the exact boundary I expected to break: values with more than 53-(64-53) = 42 bits in the significand.
I still think this is going to have a NodeJS repro if someone in the small intersection of coders that do numerical computation stuff and ECMAScript stuff (not me) wants to take a crack at it. The interpreter (JIT compiled or otherwise) is going to make hard assumptions about the floating point mode. Might be motivational.
+1
We just tried to transfer one of our projects onto WSL (https://github.com/Illumina/strelka), and traced failing unit tests down to differences like this within boost::math::hypergeomtric_distribution<>
:
#include <iostream>
#include <iomanip>
int main()
{
long double n(7.59070505394721872907e+286);
long double d(2.00440157654530257755e+291);
// linux n = 3.78701810194652734355e-05
// WSL n = 3.78701810194652746001e-05
n /= d;
std::cout.precision(21);
std::cout << "n = " << n << "\n";
}
@benhillis, this is about to be a two years old issue. Can we have some updates? Something we can do to help you guys fixing it?
I just ran into this (or a similar) issue while performance testing two different ways to do a calculation in Node.js.
On WSL the calculation started producing an incorrect result after a few thousand iterations, whereas on a RHEL virtual machine it was always correct (at least through 1,000,000 iterations). The specific iteration where the first error occurred was consistent as long as the script was unchanged, but it would vary somewhat if small changes were introduced into the script.
Here is one such script:
var sqrtfive = Math.sqrt(5);
var phi = (1 + sqrtfive) / 2;
var fibc = function (n) { return Math.round((Math.pow(phi, n)) / sqrtfive); };
var output = 0;
var correct_output = fibc(75);
var num_errors = 0;
for (var x = 0; x < 100000; x++) {
output = fibc(75);
if (output != correct_output) {
num_errors++;
console.log('error at x = ' + x);
if (num_errors > 25) {
break;
}
}
}
console.log(correct_output);
console.log(output);
Running the above through node v8.11.3 prints, on my machine:
error at x = 11644
error at x = 15526
error at x = 15527
error at x = 15528
error at x = 15529
error at x = 15530
error at x = 15531
error at x = 15532
error at x = 15533
error at x = 15534
error at x = 15535
error at x = 15536
error at x = 15537
error at x = 15538
error at x = 15539
error at x = 15540
error at x = 15541
error at x = 15542
error at x = 15543
error at x = 15544
error at x = 15545
error at x = 15546
error at x = 15547
error at x = 15548
error at x = 15549
error at x = 15550
2111485077978050
2111485077978055
(For the record, 2111485077978050 is the correct answer for Fib(75).)
Ping? Back in October I had indication out-of-band that a fix for this was coming, but I haven't seen anything about it since.
It looks like a pretty old ticket, is there any plan to resolve it? I image the fix is very straightforward, just initialize the right word in the process control block/context of the first linux process.
Ok, this problem is more complicated (and more serious) than I thought -- the x87 control word is NOT inherited by a child process across fork()
, which means there is no easy user space workaround by writing a wrapper program to properly initialize the control word then fork and exec. I believe this inheritance is required for POSIX compliance.
% cat t.c
#include <fenv.h>
#include <math.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <ucontext.h>
#include <unistd.h>
#include <xmmintrin.h>
static unsigned short _mm_getfcw()
{
unsigned short word;
asm volatile("fstcw %0" : "=m" (*(&word)));
return word;
}
static void _mm_setfcw(unsigned short word)
{
asm volatile("fldcw %0" : : "m" (*(&word)));
}
__attribute__((constructor))
static void _mm_fpuinit()
{
asm volatile("finit");
}
static void report(const char *name)
{
printf("%s\tfcw %x csr %x\n", name, _mm_getfcw(), _mm_getcsr());
}
static jmp_buf state;
static void sigfpe(int sig, siginfo_t *info, void *uap)
{
ucontext_t *uc = uap;
fpregset_t fp = uc->uc_mcontext.fpregs;
if (!fp) {
printf("sigfpe\tno fpu context\n");
} else {
printf("sigfpe\tfcw %x csr %x\n", fp->cwd, fp->mxcsr);
}
longjmp(state, 1);
}
int main(int argc, char **argv)
{
report("initial");
fesetround(FE_TOWARDZERO);
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
report("parent");
{
pthread_t thr;
pthread_create(&thr, 0, (void *(*)(void*))report, "thread");
pthread_join(thr, 0);
}
if (fork() == 0) {
report("child");
if (setjmp(state) == 0) {
long double x = 0;
struct sigaction sa;
memset(&sa, 0, sizeof sa);
sa.sa_sigaction = sigfpe;
sa.sa_flags = SA_SIGINFO;
sigaction(SIGFPE, &sa, 0);
feenableexcept(FE_DIVBYZERO);
x = 1 / x;
return (int)x;
}
} else {
int status;
wait(&status);
if (WIFSIGNALED(status)) {
printf("child killed by signal %d\n",
WTERMSIG(status));
} else if (WIFEXITED(status)) {
printf("child exited with status %d\n",
WEXITSTATUS(status));
}
}
return 0;
}
% cc -o t t.c -lm -lpthread
native% ./t
initial fcw 37f csr 1f80
parent fcw f7f csr ff80
thread fcw f7f csr ff80
child fcw f7f csr ff80
sigfpe fcw f7b csr fd80
child exited with status 0
wsl% ./t
initial fcw 37f csr 1f80
parent fcw f7f csr ff80
thread fcw 27f csr 1f80
child fcw 27f csr 1f80
child exited with status 142
It is also confirmed that the control word is NOT inherited by a child (POSIX) thread. And the problem is not limited to x87 control word, it equally affects the SSE control and status register.
Many manifestations of the same problem were reported earlier in this ticket, let me try to summarize here the full impact, based on what I have gathered so far:
long double
, which gcc maps into 80-bit extended precision floating point on x86, won't work as expectedfenv(3)
settings won't be passed down to child processes or threads, changes to rounding mode or exception masks will have to be set in each individual thread an application createsflush to zero
mode to speed up the denormal handling in SIMD codeUpdate:
There also seems to be problem with SIGFPE
generation, division by zero is expected to raise a SIGFPE
signal, but instead the process is terminated with some arbitrary exit status. My original intention was to see what x87 control word value would be saved into a sigcontext, as a way to probe if/how the kernel tracks the register. This behavior is the same if SSE is used instead of x87 (just replace the long double
with double
in the test code).
which means there is no easy user space workaround by writing a wrapper program to properly initialize the control word then fork and exec
LD_PRELOAD
__libc_start_main()
. Something like:
#include <dlfcn.h>
#include <fpu_control.h>
template<typename Tfn>
auto dlfn(const char *name)
{
return reinterpret_cast<Tfn>(dlsym(RTLD_NEXT, name));
}
extern "C"
int __libc_start_main(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void),
void (*fini) (void), void (*rtld_fini) (void), void (* stack_end))
{
auto start = dlfn<decltype(&::__libc_start_main)>("__libc_start_main");
unsigned short cw = 0x37f;
_FPU_SETCW(cw);
return start(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}
Utterly untested, that is off the cuff.
I believe this inheritance is required for POSIX compliance.
~I believe you made that up~ I believe I've been corrected. But it's still pretty disappointing the incompatibility hasn't been addressed yet.
@therealkenc Yes, it is possible to initialize the control word in the main process through LD_PRELOAD
, but it will not be passed down to any child process or more seriously any child thread, which renders it useless in most applications.
I didn't make it up about POSIX compliance, whereas there could be some latitude regarding execve(2)
, there is no ambiguity about fork(2)
, see the spec about fork, and I quote,
The fork() function shall create a new process. The new process (child process) shall be an exact copy of the calling process (parent process) except as detailed below:
Here "exact copy" is to be interpreted as all machine specific/dependent states -- I know this because I am an OS developer.
This means I also understand the fix of this problem is non-trivial. The symptom indicates the x87 control word is not maintained as part of the linux process context, it's only in the NT process context. What could further complicate this effort is that x87 coprocessor is usually implemented as lazy during context switch, that is, the FPU is initially marked as not present, any FPU instruction would trigger a fpu-unavailable trap at which time FPU context (including the control word) would be restored. It is very likely the current linux subsystem module won't even see this trap, and some changes to the NT microkernel proper may have to be made.
I didn't make it up about POSIX compliance, whereas there could be some latitude regarding execve(2), there is no ambiguity about fork(2)
Okay, conceded. I yield.
This means I also understand the fix of this problem is non-trivial.
Probably, using the "or it would have been" rule of thumb.
I would volunteer to fix it were the code open source...
You got me thinking for a few minutes if it could be done if we ever get PTRACE_SYSEMU
. I think it might... or at least, could -- as long as one is willing disallow perfect from being the enemy of good. I've got some unrelated code in the can that is shimming fork()
for what it is worth. Looks like (pasting):
// this gets called eventually via a LD_PRELOAD of libc fork()
int WslShim::fork()
{
//LOG(3) << "entering WslShim::fork";
int pid;
if (state_) {
state_->rpc_->notify_fork(rpc::fork_prepare);
pid = libc_.fork_();
if (pid == 0) {
//LOG(3) << "WslShim::fork() child";
state_->rpc_->notify_fork(rpc::fork_child);
} else {
//LOG(3) << "WslShim::fork() parent";
state_->rpc_->notify_fork(rpc::fork_parent);
}
} else {
//LOG(3) << "WslShim::fork() called with no state";
pid = libc_.fork_();
}
return pid;
}
One could stuff a _FPU_SETCW()
in there. Same can be done for (say) pthread_create()
ad nausium. That doesn't help you with a raw clone()
syscall, mind. But if we were allowed to trap it. Maybe....
Or maybe not. It's hard to differentiate open issues that fall into the category of "haven't got to it" versus "can't be done (perfectly) without disturbing the core NT team people" versus "could do it but effort". I could cite examples of all three categories on the open issue pile.
If I had to guess, your "some changes to the NT microkernel proper may have to be made" assessment is probably correct. It's actually pretty rare for issues to get marked with a big yellow 'backlog' tag.
@therealkenc I just raised the bar for you :) with the mxcsr
register, and you will also have to hook pthread_create()
, which I believe is more important given the prevalence of multithreaded applications.
It also got me wondering what other (less often used) process context are not properly maintained. For example, the debug registers, which are required for the proper functioning of gdb, or the even more obscure io permission map -- does NT even allow the linux module this level of access?
"User-space workarounds" are a non-starter because WSL is supposed to be able to run arbitrary Linux binaries, including static-linked ones. You can't hook functions by name because the name might not even exist; pthread_create or fork could have been LTO'd into main for all you know. This really needs fixing on the WSL side: SYS_clone needs to correctly clone the full register file including floating point control and status registers, and SYS_execve needs to setup the initial register file in a manner than conforms with the ABI (x87 set to full precision).
The latter (right state at initial entry/execve) is an utterly trivial one-line fix somewhere; it's ridiculous that this has been unfixed for nearly years, and makes WSL unsafe for lots of serious production use. In particular any binaries/distros built on musl libc will have dangerously wrong behavior from strtol, etc. - things like converting an exactly representable integer to the wrong value.
I'm not sure if the SYS_clone (fork/pthread_create) side is quite as easy - it might be 5 or 10 lines instead of 1. But I think that's well within the range of what could be done in 3 years...
@richfelker - We are a small team trying to catch up with hundreds of Linux kernel developers that have a 20 year head start. @therealkenc is correct that this change (I hesitate to call it a fix, because the issue is usermode libraries using assumed values) has greater implications than you assume for the underlying Windows OS around saving / restoring the control register so Windows apps are not broken.
I know vinegar is more available on the internet, but flys prefer honey.
It absolutely is a fix, not "a change", because the ABI is that the x87 control word is extended precision at entry. The usermode apps/libraries in question are Linux apps/libraries, not Windows ones, which correctly assume the specified ABI, not the mismatched ABI which WSL has been providing. I don't see any plausible argument that code written for and tested on systems with the right ABI somehow expects the wrong ABI that WSL has been providing.
I understand that you're a small team and that shouting at FOSS maintainers about their bugs is not nice (being one myself and all), but this is not a community FOSS project, it's a major Windows feature Microsoft is promoting as giving you the ability to (safely, one would hope!) run Linux apps/distros/containers on Windows, and it doesn't do that, because of a bug that could be fixed with a trivial patch. I've been told multiple times that people were working on this and that we could expect a fix soon, and it's been neglected again and again.
Your assumption the fix is trivial is incorrect.
For the initial-entry case (ignoring clone for now), can you clarify why just setting the x87 control word right before jumping to the ELF entry point address does not suffice?
@benhillis As a kernel developer myself, I'd put the responsibility squarely on the kernel side of the court. This is much more serious than "usermode libraries using assumed values", the most damning part is the non-compliance of fork()
and pthread_create()
with respect to inheritance of certain execution context, maybe underestimating the seriousness of this bug is why this issue got less attention it deserves?
@richfelker The fix however could be more involved than you might have imagined. It depends on the internal interface between the NT microkernel and the POSIX subsystem, and also the interaction with the Windows subsystem. It could be as simple as POSIX reaching down and modifying the NT thread context, or it may have to make a kernel IPC call instead (if NT is anything like Mach) -- for all I know, the POSIX subsystem may not even be running in ring 0.
@luoqi-git - Correct, this requires some fairly substantial changes to the Windows kernel itself.
Admittedly there's code I can't see, but conceptually I don't see how this could require any invasive changes. Assuming Windows already lets processes modify the x87 control word and preserves it across task switch (which afaik it does; I've seen this before in awful graphics lib stuff that broke things by setting it to single precision!), just setting it right before jumping to the entry point should solve the problem at entry. This is purely a userspace operation. The fork/clone part is harder for me to make predictions about without being able to see source.
I say it is time for Microsoft to open up the NT source code, at least the microkernel part, just like what Apple has done (and Sun Microsystems). There will be more benefits to Microsoft (a share holder I am:-) than any potential downside, for one, I'd be very willing to help out on a fix to this problem.
Now it even owns the proper forum for the code release...
@richfelker Let me give you an example why this is more complicated than you think. Saving and restoring FPU/SSE context is quite expensive, much more than that of the integer register file, that's why all OSes that I know of would perform the save/restore at context switch only if a process has actually used FPU. Always modifying the FPU control register from userspace would completely nullify the optimization, that's something no decent OS developer would ever do (but a desperate application developer might).
I'm quite aware of that, but the "no decent OS developer" comment is rather exaggerated. These days many operating systems do not make this optimization for various reasons; it's nowhere near as valuable as it was a couple decades ago. In any case you could make the fix that way for now, and optimize later if it turns out the performance hit is a problem, rather than leaving it broken for the sake of performance.
FWIW the issue you cited is exactly why we're not going to workaround the WSL bug in musl; doing so would make all binaries generated with the workaround unconditionally poke the fpu when they shouldn't have to, and the cost would not go away when WSL gets fixed because it would be linked into the binaries rather than in OS-level code that will go away once it's fixed right.
@richfelker Brother, I feel your pain. I am here not because I'm a kernel developer, but because an application my team developed was hit by this bug. I would very much like to see this problem fixed and am offering any help I can render. At this point I may be desperate enough to insert an finit
instruction at the beginning of any thread our app creates.
FWIW the issue you cited is exactly why we're not going to workaround the WSL bug in musl; doing so would make all binaries generated with the workaround unconditionally poke the fpu when they shouldn't have to
I suspect, but can't prove, a variation of this argument is being applied by WSL devs too. It could be ~fixed~ ~hacked~ worked-around in the WSL driver, but a "proper" fix is an upstream problem. So here were are.
Which is good. Everyone is in agreement that without a perfect fix all hope is lost, and hopefully this thread can go quiet again.
@therealkenc, I don't see them as equivalent arguments at all. One is about a suboptimal fix at the level of a component that ships with OS updates and never gets linked into applications, so that, if it's costly, it can go away in a future OS update when a better fix is made. The other is about hard-coding a workaround into application binaries in a way that affects running them on systems not affected by the bug, including real Linux systems and the expected post-fix WSL systems at some point in the future.
Although you want same output between native Linux and WSL, I think the double / long double do not have explicit size in C++ standard so it is just like a Linux distro run a difference hardware and therefore has different result. Maybe you need a quick workaround and not a substantial change in kernel. WSL has compatibility is better though.
@po-wei It's true that the C/C++ language standards don't specify the exact precision of long double
. They leave that up to the platform ABI. And the standard ABI for Linux x86 and x86-64 says that long double
has 80 bit precision, and that the x87 FPU control word has to be initialized the right way.
Other platforms are free to implement C/C++ with different ABIs, but WSL is trying to implement the Linux x86/x86-64 ABI, so that's the relevant standard. This is definitely a bug in WSL.
Note that several of the people posting here are well-known experts who are involved in making C/C++ work on Linux in the first place. They know what they're talking about :-)
@njsmith Thanks for your explanation!
@Po-wei In some way, it did breaks C++ standard library. (Beside the break of C library demonstrated previously) Specifically, it breaks std::numeric_limits<long double>
in header <limits>
.
For example, in WSL, std::numeric_limits<long double>::epsilon()
tells me (approximately) 1.0842e-19
. However, if you write a loop to actually test machine epsilon of long double
, the result is 2.22045e-16
:
// g++ -std=c++11 test_ld.cpp && ./a.out
#include <iostream>
#include <limits>
int main()
{
typedef std::numeric_limits<long double> TT;
std::cout << "mantissa bits : " << TT::digits << "\n"; // =64
std::cout << " epsilon : " << TT::epsilon() << "\n"; // =1.0842e-19
long double t = 1.0;
while (1 + t/2 != 1) {
t = t/2;
}
std::cout << "measured epsilon : " << t << "\n"; // WSL: =2.22045e-16
}
So, even a standard conforming code/programmer (that didn't assumes accuracy of long double
) could be fooled by std::numeric_limits<long double>
which is supposed to be a standard way to extract numerical type information. The C++ standard didn't mention floating point control word.
Why has this problem never been fixed? Why is it not mentioned in the WSL FAQ? It has caused me a lot of trouble.
I assume this is one of the advantages of running a real Linux kernel under virtualization (which is what WSL2 does) rather than having this aspect of syscalls be handled by the NT kernel directly.
Unless WSL1 is entirely abandoned in favor of WSL2, I think this still needs to be fixed. Maybe now that WSL2 is out there and performance-sensitive users are going to be switching to it, the simple fix for WSL1 will be less controversial...?
Ping. People are still hitting this. See the new reference above and the post on the musl list: https://www.openwall.com/lists/musl/2019/09/25/16
Ping. Is WSL1 still a thing, and if so, can this please be fixed?
Ping. A user just hit this again today. So apparently people still are using WSL1. Being that this would literally be a one-line fix, is there any reason it can't be done??
This program
outputs (Linux, expected result):
1.00000000000000000e-05
But in BashOnWindows, it outputs:
1.00000000000000008e-05
The long double data type in C/C++ under Linux using GCC for x86/x86-64 platform has an 80-bit format with 64 bits mantissa. When a program using long double runs in BashOnWindows it behaves like having 53 bits mantissa number just like the usual "double" (but with almost the same exponential range as long double)
This harms functions like std::pow(long double, int) and hence
boost::lexical_cast<double>(char*)
(boost version 1.55) since internally it uses long double as intermediate result (but boost version 1.60 fixes this problem), thereforeboost::program_options
can reads double type options inaccurately also (even for exact input e.g. 0.03125 = 1/32).Likewise this program:
Should output:
Good.
While in BashOnWindows, it output:What? t0 = 7.46536864129530799e-4948 != 3.64519953188247460e-4951