Open EButlerIV opened 6 years ago
@hvr this seems like a pretty gross bug, also might lurk in base too?
@hvr
seems like the max file chunk is Max of Int32, because
bigBS <- pure $ BS.replicate (2^31 - 1) 1
is the largest size where the write doesn't fail
I'm currently investigating the status of largefile support on OSX...
My conclusion so far is that it's definitely OSX being at fault here, as it doesn't appear to conform to expectations for a POSIX system:
The write(2)
system call is defined at http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
Even though fpathconf(fd, _PC_FILESIZEBITS)
returns 64 (and ssize_t
/size_t
is 64bit wide), write(fd, buf, 2147483648)
would return -1
w/ errnor = EINVAL
Issuing write(fd, buf, 2147483647)
succeeds.
Issuing write(fd, buf, 1073741824)
twice succeeds as well (resulting in the full 2 GiB being written to the file).
If write(2)
wasn't able to write more than 2GiB, it would be more appropriate to report the number of bytes written less than 2GiB rather than report EINVAL
(which is reserved for real file offset over/underflows).
Consequently, write(2)
's on OSX seems to behave as if size_t
was 32bit wide even though it's 64bit wide. I consider this a bug in OSX.
either way, it does result in unexpected behavior for write invocations on OSX :(
Problem is that while we can add some hacks to unix
and base
etc, to paper over this bug by having the respective write(2)
invocations truncate/clamp their arguments to 210^9-1 bytes iff on OSX, it's easy to miss some write(2)
call-sites, and also we obviously can't catch all* write
invocations, as I'm sure there's enough code out there which FFI calls to write(2)
directly, as it's such an old and well-defined POSIX function that exists essentially everywhere...
Also, once Apple fixes this in their operating system, how do we detect that we don't need this workaround anymore? I.e. writing an Autoconf test which tries to write a 2GiB databuffer to some fd sounds like a bad thing to do... or shall this be done at runtime, by retrying w/ a clamped buffer size whenever we get an EINVAL for a bufsize >= 2GiB? I.e.
ssize_t fake_write(int fd, const void *buf, size_t count)
{
ssize_t written = write(fd, buf, count);
#if defined(__APPLE__)
if (written < 0 && errno == EINVAL && count >= 2147483648ULL)
written = write(fd, buf, 2147483647ULL);
#endif
return written;
}
honestly that dynamic workaround might be the best way to go, though documenting why we have that work around would probably be a good idea (so a hypothetical future can drop it)
the scary/sad part is the same issue applies to the read SystemCall too afaik! (though i've not tested that one as yet)
i do agree that theres a little bit of a damned if we do or dont to how to resolve this :/
I cannot reproduce this on macOS 11.3. I'll close this issue in a week or two, unless someone can confirm that the problem still exists.
Have you tested on 10.14 or 10.15? I think a lotta people are still on those
I have only one mac laptop :)
If the issue has been fixed on a platform level, I have little interest to patch it on a package level. We cannot be hold responsible for upstream bugs. I'm nevertheless open to accept a patch from an affected individual, who is stuck on an older platform, but TBH I do not see it happening. It did not happen in three years.
True ;)
I’ll try to check if it happens on my current computers. It’s def the case that this package has r gotten much love
Hrm. Perhaps adding a comment noting that larger than 2gb writes on older OS X might fail is the right middle ground? Idk.
On Sat, May 15, 2021 at 4:48 PM Bodigrim @.***> wrote:
I have only one mac laptop :)
If the issue has been fixed on a platform level, I have little interest to patch it on a package level. We cannot be hold responsible for upstream bugs. I'm nevertheless open to accept a patch from an affected individual, who is stuck on an older platform, but TBH I do not see it happening. It did not happen in three years.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/bytestring/issues/153#issuecomment-841723165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXGNZHGXDS24PNYYXDTN3M3RANCNFSM4EYKPHLQ .
It looks like the bug has been fixed between https://opensource.apple.com/source/Libc/Libc-825.40.1/stdio/FreeBSD/fwrite.c.auto.html and https://opensource.apple.com/source/Libc/Libc-997.1.1/stdio/FreeBSD/fwrite.c.auto.html; there are no future changes to this file.
However, Libc-825 corresponds to OS X 10.8.4, which EOLed in 2015, while Python issues are reported for OS X 10.10 and 10.11. How could it be so? There is certainly no issue on OS X 11.3.
CC @vdukhovni
This is awesome sleuthing! And that discrepancy is certainly concerning
I do agree that at a certain level we should fix Haskell code that’s doing stupidly large write chunks(which make this a hypothetical non issue), or at least characterize / understand the breakage even if it’s ultimately just a warning note ..
On a Big Sur 11.3.1 (latest) MacOS system, I can reproduce the reported issue with write(2)
system call, but interestingly it does not apply to writev(2)
:
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <sys/uio.h>
#include <inttypes.h>
#include <err.h>
#define NBYTES ((1ULL << 31) - 1)
int main(int argc, char *argv[])
{
ssize_t nb = (size_t) (argv[1] ? strtoimax(argv[1], NULL, 0) : NBYTES);
ssize_t n;
char *buf = malloc(nb);
struct iovec v;
v.iov_base = buf;
v.iov_len = (size_t)nb;
n = (argc < 3 || argv[2] == NULL) ? writev(1, &v, 1) : write(1, buf, nb);
if (n != nb)
err(1, "write");
return 0;
}
Test runs show:
./wr $(( 2**31 )) write | wc -c
wr: write: Invalid argument
0
but with writev
I get correct POSIX behaviour:
./wr $(( 2**32 )) | wc -c
4294967296
So, putting aside questions of what happens when users directly use FFI calls to write(2)
, we can opt to always use writev(2)
on MacOS and avoid the issue.
Which is not to say that applications should typically call write(2) with enormous buffers in one go. One would typically loop writing something between 4k and 1MB at a time (usually one of 4K, 8K, 16K or 32K). This also helps with "unsafe" calls blocking capabilities in the RTS unreasonably long. Attempts to write 2GB or more in one go are probably not a good idea.
I ran sudo dtruss ...
to check whether the failure was in libc or at the system call layer, and it sure seems like the issue is in the kernel system call interface, not the C library:
$ sudo dtruss ./wr $(( 2**31 )) write > /dev/null
open("/dev/dtracehelper\0", 0x2, 0x0) = 3 0
...
sysctlbyname(kern.osvariant_status, 0x15, 0x7FFEE4E50F80, 0x7FFEE4E50F78, 0x0) = 0 0
csops(0x85A8, 0x0, 0x7FFEE4E50FB4) = 0 0
write(0x1, "\0", 0x80000000) = -1 22
...
So fixes in the C library would not be sufficient, the kernel needs to support large write(2) calls.
It seems, that perhaps testing with output to a pipe can be misleading, the same test seems to fail when the output is /dev/null
or a file... :-( So writev(2)
does not look like a viable work-around...
Even splitting the write into multiple iovecs each of at most 2^30
bytes fails once the total size exceeds 2^31 - 1
. So the only workaround is an explicit loop doing multiple write calls... Applications flushing very large buffers in a single write call are in a state of sin...
@vdukhovni thanks for investigation. I'm confused that you can reproduce the issue by calling write
directly, while Haskell reproducer above works fine on my machine. Does Haskell snippet pass in your environment?
Yeah, I agree that writing gigabytes in one call is not quite sensible anyways. It's easy to switch hPut
to loop over smaller chunks, but on the other hand at least some users (see #259) expect hPut
to be an atomic operation.
Atomicity of large writes is fragile. Without O_APPEND
, two processes might write at overlapping file offsets if they opened the file independently. Secondly, Linux prior to 3.14 also failed to be atomic:
BUGS
According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interac‐
tions with Regular File Operations"):
All of the following functions shall be atomic with respect to
each other in the effects specified in POSIX.1-2008 when they
operate on regular files or symbolic links: ...
Among the APIs subsequently listed are write() and writev(2). And
among the effects that should be atomic across threads (and processes)
are updates of the file offset. However, on Linux before version
3.14, this was not the case: if two processes that share an open file
description (see open(2)) perform a write() (or writev(2)) at the same
time, then the I/O operations were not atomic with respect updating
the file offset, with the result that the blocks of data output by the
two processes might (incorrectly) overlap. This problem was fixed in
Linux 3.14.
Then of course there's also NFS, which may not provide the same semantics when the race is between multiple clients...
As for reproducing the issue in Haskell, I see the same problem:
import System.IO
import qualified Data.ByteString as BS
main :: IO ()
main = do
hSetBinaryMode stdout True
bigBS <- pure $ BS.replicate (2^32) 1
BS.hPut stdout bigBS
which compiled and run yields:
$ ./foo >/dev/null
foo: <stdout>: hPutBuf: invalid argument (Invalid argument)
All right, let's loop with smaller chunks. Should these chunks be as big as possible (2^31-1 bytes) to retain as much atomicity as possible, or is it better to make them reasonably sized at 64k or similar?
I'd go with 1MB, which should comfortably exceed most block sizes in ZFS or flash storage, avoiding partial block writes.
Which means 2048 writes for 2GB, but that's not unreasonable. If the loop is in the Haskell code calling into the FFI, then we even avoid tying up RTS capabilities for a long time with long-duration unsafe
calls.
As for reproducing the issue in Haskell, I see the same problem
Surprisingly, I still cannot reproduce it.
Yep, splitting hPut
writes into 1M chunks sounds good to me.
Surprisingly, I still cannot reproduce it.
What version of MacOS (Darwin) are you testing on, and what CPU architecture? Please also test with the below dtrace(1) script:
#pragma D option bufsize=8k
#pragma D option bufpolicy=fill
syscall::write:entry
/pid == $target/
{
printf("write bytes = %d, ", arg2);
}
syscall::write:return
/pid == $target/
{
printf("ret = %d\n", arg0);
}
which I've saved as foo.d
, and then executed the compiled haskell code:
{-# LANGUAGE LambdaCase #-}
import System.IO
import System.Environment
import qualified Data.ByteString as BS
main :: IO ()
main = do
n <- getArgs >>= \case
[] -> return $ 2^32
arg:_ -> readIO arg
hSetBinaryMode stdout True
let bs = BS.replicate n 0x38
BS.hPut stdout bs
saved as foo.hs
by running (as root
since dtrace(1)
requires privs on MacOS):
# dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null
dtrace: system integrity protection is on, some features will not be available
foo: <stdout>: hPutBuf: invalid argument (Invalid argument)
write bytes = 4294967296, ret = -1
Please report the results for -c ./foo
(default 2^32
size) and also -c "./foo 8192"
for comparison.
this is getting interesting! thx for helping distill the issue so wonderfully both of you! (i'm now really curious :) )
unrelatedly: most array batch primops in ghc should do chunking like that too, to make sure long running primops dont block the GC too! (this has been known for a while, but hasnt been prioritized ever )
FWIW, a slightly enhanced dtrace(1)
script can also return the errno
value, confirming the EINVAL
:
#pragma D option bufsize=8k
#pragma D option bufpolicy=fill
syscall::write:entry
/pid == $target/
{
printf("write bytes = %d, ", arg2);
}
syscall::write:return
/pid == $target/
{
printf("ret = %d %d\n", arg0, errno);
}
Which is 22 on MacOS:
$ perl -le 'use Errno qw(:POSIX); $! = EINVAL; printf "%d %s\n", $!, $!;'
22 Invalid argument
and so I get:
# dtrace -o /dev/stderr -q -s foo.d -c "./foo $(( 2 ** 31 ))" >/dev/null
dtrace: system integrity protection is on, some features will not be available
foo: <stdout>: hPutBuf: invalid argument (Invalid argument)
write bytes = 2147483648, ret = -1 22
@vdukhovni I'm running macOS 11.3 on Intel i5 CPU.
$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo 8192" >/dev/null
dtrace: system integrity protection is on, some features will not be available
write bytes = 8192, ret = 8192
$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null
dtrace: system integrity protection is on, some features will not be available
write bytes = 2147479552, ret = 2147479552
write bytes = 2147479552, ret = 2147479552
write bytes = 8192, ret = 8192
$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null dtrace: system integrity protection is on, some features will not be available
write bytes = 2147479552, ret = 2147479552 write bytes = 2147479552, ret = 2147479552 write bytes = 8192, ret = 8192
This shows your "foo" program splitting the write into two separate writes, of 2^31 - 8192
and 8192
bytes respectively. Are you running a different ByteString version that does that?
No, I'm running a stock version of bytestring
. It looks to me like 2147479552 = 2^31 - 4096
corresponds to MAXWRITE
, and traced calls to a loop over __sfvwrite
in https://opensource.apple.com/source/Libc/Libc-997.1.1/stdio/FreeBSD/fwrite.c.auto.html, but this is a very uneducated guess.
But fwrite(3)
is at the <stdio.h>
layer, it is not the underlying system call. It is not clear how that affects Haskell's ByteString
using write(2)
.
Writing more than ~2GB at once to a file appears to fail on OSX. This may be related to a platform issue also encountered in this python ticket: https://bugs.python.org/issue24658
Repro:
The following code fails on OSX, emitting an exception:
*** Exception: testFile: hPutBuf: invalid argument (Invalid argument)
Alt approaches
Manually creating a file handle and using hPut directly makes no difference
Writing a lazy bytestring also makes no difference