hPutBuf/writeFile fails on OSX for bytestrings over ~2^31 bytes in size

EButlerIV commented 6 years ago

Writing more than ~2GB at once to a file appears to fail on OSX. This may be related to a platform issue also encountered in this python ticket: https://bugs.python.org/issue24658

Repro:

The following code fails on OSX, emitting an exception:


import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS

main :: IO ()
main = do
  bigBS <- pure $ BS.replicate (2^32) 1
  BS.writeFile "testFile" bigBS

*** Exception: testFile: hPutBuf: invalid argument (Invalid argument)

Alt approaches

Manually creating a file handle and using hPut directly makes no difference


import System.IO
import qualified Data.ByteString as BS

main :: IO ()
main = do
  bigBS <- pure $ BS.replicate (2^32) 1
  fileHandle <- openFile "testFile" WriteMode
  BS.hPut fileHandle bigBS

Writing a lazy bytestring also makes no difference

module Main where

import System.IO
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS

main :: IO ()
main = do
  bigBS <- pure $ BS.replicate (2^32) 1
  fileHandle <- openFile "testFile" WriteMode
  LBS.hPut fileHandle (LBS.fromStrict bigBS)

cartazio commented 6 years ago

@hvr this seems like a pretty gross bug, also might lurk in base too?

cartazio commented 6 years ago

@hvr

seems like the max file chunk is Max of Int32, because

  bigBS <- pure $ BS.replicate (2^31 - 1) 1

is the largest size where the write doesn't fail

hvr commented 6 years ago

I'm currently investigating the status of largefile support on OSX...

hvr commented 6 years ago

My conclusion so far is that it's definitely OSX being at fault here, as it doesn't appear to conform to expectations for a POSIX system:

The write(2) system call is defined at http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

Even though fpathconf(fd, _PC_FILESIZEBITS) returns 64 (and ssize_t/size_t is 64bit wide), write(fd, buf, 2147483648) would return -1 w/ errnor = EINVAL

Issuing write(fd, buf, 2147483647) succeeds.

Issuing write(fd, buf, 1073741824) twice succeeds as well (resulting in the full 2 GiB being written to the file).

If write(2) wasn't able to write more than 2GiB, it would be more appropriate to report the number of bytes written less than 2GiB rather than report EINVAL (which is reserved for real file offset over/underflows).

Consequently, write(2)'s on OSX seems to behave as if size_t was 32bit wide even though it's 64bit wide. I consider this a bug in OSX.

cartazio commented 6 years ago

either way, it does result in unexpected behavior for write invocations on OSX :(

hvr commented 6 years ago

Problem is that while we can add some hacks to unix and base etc, to paper over this bug by having the respective write(2) invocations truncate/clamp their arguments to 210^9-1 bytes iff on OSX, it's easy to miss some write(2) call-sites, and also we obviously can't catch all* write invocations, as I'm sure there's enough code out there which FFI calls to write(2) directly, as it's such an old and well-defined POSIX function that exists essentially everywhere...

Also, once Apple fixes this in their operating system, how do we detect that we don't need this workaround anymore? I.e. writing an Autoconf test which tries to write a 2GiB databuffer to some fd sounds like a bad thing to do... or shall this be done at runtime, by retrying w/ a clamped buffer size whenever we get an EINVAL for a bufsize >= 2GiB? I.e.



ssize_t fake_write(int fd, const void *buf, size_t count)
{
  ssize_t written = write(fd, buf, count);

#if defined(__APPLE__)
  if (written < 0 && errno == EINVAL && count >= 2147483648ULL)
    written = write(fd, buf, 2147483647ULL);
#endif

  return written;
}

cartazio commented 6 years ago

honestly that dynamic workaround might be the best way to go, though documenting why we have that work around would probably be a good idea (so a hypothetical future can drop it)

the scary/sad part is the same issue applies to the read SystemCall too afaik! (though i've not tested that one as yet)

cartazio commented 6 years ago

i do agree that theres a little bit of a damned if we do or dont to how to resolve this :/

Bodigrim commented 3 years ago

I cannot reproduce this on macOS 11.3. I'll close this issue in a week or two, unless someone can confirm that the problem still exists.

cartazio commented 3 years ago

Have you tested on 10.14 or 10.15? I think a lotta people are still on those

Bodigrim commented 3 years ago

I have only one mac laptop :)

If the issue has been fixed on a platform level, I have little interest to patch it on a package level. We cannot be hold responsible for upstream bugs. I'm nevertheless open to accept a patch from an affected individual, who is stuck on an older platform, but TBH I do not see it happening. It did not happen in three years.

cartazio commented 3 years ago

True ;)

I’ll try to check if it happens on my current computers. It’s def the case that this package has r gotten much love

cartazio commented 3 years ago

Hrm. Perhaps adding a comment noting that larger than 2gb writes on older OS X might fail is the right middle ground? Idk.

On Sat, May 15, 2021 at 4:48 PM Bodigrim @.***> wrote:

I have only one mac laptop :)

If the issue has been fixed on a platform level, I have little interest to patch it on a package level. We cannot be hold responsible for upstream bugs. I'm nevertheless open to accept a patch from an affected individual, who is stuck on an older platform, but TBH I do not see it happening. It did not happen in three years.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haskell/bytestring/issues/153#issuecomment-841723165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXGNZHGXDS24PNYYXDTN3M3RANCNFSM4EYKPHLQ .

Bodigrim commented 3 years ago

https://discussions.apple.com/thread/3413139 The bug has been fixed somewhere between https://opensource.apple.com/source/Libc/Libc-763.12/stdio/fwrite-fbsd.c and https://opensource.apple.com/source/Libc/Libc-1439.40.11/stdio/FreeBSD/fwrite.c.auto.html

Bodigrim commented 3 years ago

It looks like the bug has been fixed between https://opensource.apple.com/source/Libc/Libc-825.40.1/stdio/FreeBSD/fwrite.c.auto.html and https://opensource.apple.com/source/Libc/Libc-997.1.1/stdio/FreeBSD/fwrite.c.auto.html; there are no future changes to this file.

However, Libc-825 corresponds to OS X 10.8.4, which EOLed in 2015, while Python issues are reported for OS X 10.10 and 10.11. How could it be so? There is certainly no issue on OS X 11.3.

CC @vdukhovni

cartazio commented 3 years ago

This is awesome sleuthing! And that discrepancy is certainly concerning

cartazio commented 3 years ago

I do agree that at a certain level we should fix Haskell code that’s doing stupidly large write chunks(which make this a hypothetical non issue), or at least characterize / understand the breakage even if it’s ultimately just a warning note ..

vdukhovni commented 3 years ago

On a Big Sur 11.3.1 (latest) MacOS system, I can reproduce the reported issue with write(2) system call, but interestingly it does not apply to writev(2):

#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <sys/uio.h>
#include <inttypes.h>
#include <err.h>

#define NBYTES ((1ULL << 31) - 1)

int main(int argc, char *argv[])
{
    ssize_t nb = (size_t) (argv[1] ? strtoimax(argv[1], NULL, 0) : NBYTES);
    ssize_t n;
    char *buf = malloc(nb);
    struct iovec v;

    v.iov_base = buf;
    v.iov_len = (size_t)nb;

    n = (argc < 3 || argv[2] == NULL) ? writev(1, &v, 1) : write(1, buf, nb);
    if (n != nb)
        err(1, "write");

    return 0;
}

Test runs show:

./wr $(( 2**31  )) write | wc -c
wr: write: Invalid argument
       0

but with writev I get correct POSIX behaviour:

./wr $(( 2**32  )) | wc -c
 4294967296

So, putting aside questions of what happens when users directly use FFI calls to write(2), we can opt to always use writev(2) on MacOS and avoid the issue.

vdukhovni commented 3 years ago

Which is not to say that applications should typically call write(2) with enormous buffers in one go. One would typically loop writing something between 4k and 1MB at a time (usually one of 4K, 8K, 16K or 32K). This also helps with "unsafe" calls blocking capabilities in the RTS unreasonably long. Attempts to write 2GB or more in one go are probably not a good idea.

vdukhovni commented 3 years ago

I ran sudo dtruss ... to check whether the failure was in libc or at the system call layer, and it sure seems like the issue is in the kernel system call interface, not the C library:

$ sudo dtruss ./wr $(( 2**31  )) write > /dev/null
open("/dev/dtracehelper\0", 0x2, 0x0)            = 3 0
...
sysctlbyname(kern.osvariant_status, 0x15, 0x7FFEE4E50F80, 0x7FFEE4E50F78, 0x0)           = 0 0
csops(0x85A8, 0x0, 0x7FFEE4E50FB4)               = 0 0
write(0x1, "\0", 0x80000000)             = -1 22
...

So fixes in the C library would not be sufficient, the kernel needs to support large write(2) calls.

vdukhovni commented 3 years ago

It seems, that perhaps testing with output to a pipe can be misleading, the same test seems to fail when the output is /dev/null or a file... :-( So writev(2) does not look like a viable work-around...

vdukhovni commented 3 years ago

Even splitting the write into multiple iovecs each of at most 2^30 bytes fails once the total size exceeds 2^31 - 1. So the only workaround is an explicit loop doing multiple write calls... Applications flushing very large buffers in a single write call are in a state of sin...

Bodigrim commented 3 years ago

@vdukhovni thanks for investigation. I'm confused that you can reproduce the issue by calling write directly, while Haskell reproducer above works fine on my machine. Does Haskell snippet pass in your environment?

Yeah, I agree that writing gigabytes in one call is not quite sensible anyways. It's easy to switch hPut to loop over smaller chunks, but on the other hand at least some users (see #259) expect hPut to be an atomic operation.

vdukhovni commented 3 years ago

Atomicity of large writes is fragile. Without O_APPEND, two processes might write at overlapping file offsets if they opened the file independently. Secondly, Linux prior to 3.14 also failed to be atomic:

BUGS   
       According  to  POSIX.1-2008/SUSv4  Section XSI 2.9.7 ("Thread Interac‐
       tions with Regular File Operations"):

           All of the following functions shall be  atomic  with  respect  to
           each  other  in  the  effects  specified in POSIX.1-2008 when they
           operate on regular files or symbolic links: ...

       Among the APIs subsequently listed are  write()  and  writev(2).   And
       among the effects that should be atomic across threads (and processes)
       are updates of the file offset.   However,  on  Linux  before  version
       3.14,  this was not the case: if two processes that share an open file
       description (see open(2)) perform a write() (or writev(2)) at the same
       time,  then  the  I/O operations were not atomic with respect updating
       the file offset, with the result that the blocks of data output by the
       two  processes might (incorrectly) overlap.  This problem was fixed in
       Linux 3.14.

Then of course there's also NFS, which may not provide the same semantics when the race is between multiple clients...

As for reproducing the issue in Haskell, I see the same problem:

import System.IO
import qualified Data.ByteString as BS

main :: IO ()
main = do
  hSetBinaryMode stdout True
  bigBS <- pure $ BS.replicate (2^32) 1
  BS.hPut stdout bigBS

which compiled and run yields:

$ ./foo >/dev/null
foo: <stdout>: hPutBuf: invalid argument (Invalid argument)

Bodigrim commented 3 years ago

All right, let's loop with smaller chunks. Should these chunks be as big as possible (2^31-1 bytes) to retain as much atomicity as possible, or is it better to make them reasonably sized at 64k or similar?

vdukhovni commented 3 years ago

I'd go with 1MB, which should comfortably exceed most block sizes in ZFS or flash storage, avoiding partial block writes.

vdukhovni commented 3 years ago

Which means 2048 writes for 2GB, but that's not unreasonable. If the loop is in the Haskell code calling into the FFI, then we even avoid tying up RTS capabilities for a long time with long-duration unsafe calls.

Bodigrim commented 3 years ago

As for reproducing the issue in Haskell, I see the same problem

Surprisingly, I still cannot reproduce it.

Yep, splitting hPut writes into 1M chunks sounds good to me.

vdukhovni commented 3 years ago

Surprisingly, I still cannot reproduce it.

What version of MacOS (Darwin) are you testing on, and what CPU architecture? Please also test with the below dtrace(1) script:

#pragma D option bufsize=8k
#pragma D option bufpolicy=fill

syscall::write:entry
/pid == $target/
{
        printf("write bytes = %d, ", arg2);
}

syscall::write:return
/pid == $target/
{
        printf("ret = %d\n", arg0);
}

which I've saved as foo.d, and then executed the compiled haskell code:

{-# LANGUAGE LambdaCase #-}
import System.IO
import System.Environment
import qualified Data.ByteString as BS

main :: IO ()
main = do
  n <- getArgs >>= \case
        []    -> return $ 2^32
        arg:_ -> readIO arg
  hSetBinaryMode stdout True
  let bs = BS.replicate n 0x38
  BS.hPut stdout bs

saved as foo.hs by running (as root since dtrace(1) requires privs on MacOS):

# dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null
dtrace: system integrity protection is on, some features will not be available

foo: <stdout>: hPutBuf: invalid argument (Invalid argument)
write bytes = 4294967296, ret = -1

Please report the results for -c ./foo (default 2^32 size) and also -c "./foo 8192" for comparison.

cartazio commented 3 years ago

this is getting interesting! thx for helping distill the issue so wonderfully both of you! (i'm now really curious :) )

unrelatedly: most array batch primops in ghc should do chunking like that too, to make sure long running primops dont block the GC too! (this has been known for a while, but hasnt been prioritized ever )

vdukhovni commented 3 years ago

FWIW, a slightly enhanced dtrace(1) script can also return the errno value, confirming the EINVAL:

#pragma D option bufsize=8k
#pragma D option bufpolicy=fill

syscall::write:entry
/pid == $target/
{
        printf("write bytes = %d, ", arg2);
}

syscall::write:return
/pid == $target/
{
        printf("ret = %d %d\n", arg0, errno);
}

Which is 22 on MacOS:

$ perl -le 'use Errno qw(:POSIX); $! = EINVAL; printf "%d %s\n", $!, $!;'
22 Invalid argument

and so I get:

# dtrace -o /dev/stderr -q -s foo.d -c "./foo $(( 2 ** 31 ))" >/dev/null
dtrace: system integrity protection is on, some features will not be available

foo: <stdout>: hPutBuf: invalid argument (Invalid argument)
write bytes = 2147483648, ret = -1 22

Bodigrim commented 3 years ago

@vdukhovni I'm running macOS 11.3 on Intel i5 CPU.

$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo 8192" >/dev/null
dtrace: system integrity protection is on, some features will not be available

write bytes = 8192, ret = 8192

$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null
dtrace: system integrity protection is on, some features will not be available

write bytes = 2147479552, ret = 2147479552
write bytes = 2147479552, ret = 2147479552
write bytes = 8192, ret = 8192

vdukhovni commented 3 years ago

$ sudo dtrace -o /dev/stderr -q -s foo.d -c "./foo" >/dev/null dtrace: system integrity protection is on, some features will not be available

write bytes = 2147479552, ret = 2147479552 write bytes = 2147479552, ret = 2147479552 write bytes = 8192, ret = 8192

This shows your "foo" program splitting the write into two separate writes, of 2^31 - 8192 and 8192 bytes respectively. Are you running a different ByteString version that does that?

Bodigrim commented 3 years ago

No, I'm running a stock version of bytestring. It looks to me like 2147479552 = 2^31 - 4096 corresponds to MAXWRITE, and traced calls to a loop over __sfvwrite in https://opensource.apple.com/source/Libc/Libc-997.1.1/stdio/FreeBSD/fwrite.c.auto.html, but this is a very uneducated guess.

vdukhovni commented 3 years ago

But fwrite(3) is at the <stdio.h> layer, it is not the underlying system call. It is not clear how that affects Haskell's ByteString using write(2).

haskell / bytestring

hPutBuf/writeFile fails on OSX for bytestrings over ~2^31 bytes in size #153

Repro:

Alt approaches