Make safe/unsafe foreign calls size dependent

haskell / bytestring

An efficient compact, immutable byte string type (both strict and lazy) suitable for binary or 8-bit character data.

http://hackage.haskell.org/package/bytestring

Other

289 stars 139 forks source link

Make safe/unsafe foreign calls size dependent #282

Open WJWH opened 4 years ago

WJWH commented 4 years ago

In https://github.com/haskell/bytestring/blob/master/Data/ByteString/Internal.hs#L747, we import memcpy as an unsafe FFI call, which means that it runs "inline" and will block all other Haskell threads running on the same capability in the threaded runtime. If the Bytestring being copied is very large, this might block the other Haskell threads for longer than is desired.

In the crypto libraries, there are both safe and unsafe imports which are switched between depending on the size of the input. This trick might be useful here as well.

Interested to hear your opinions about this, I can pick it up if this is something desired.

vdukhovni commented 4 years ago

One might say that, for large enough buffers, performance-sensitive applications should not be using strict bytestrings, and use Lazy (chunked) ones instead or streams.

I am not sure that we should introduce automatic switching based on size, but we could perhaps leave the choice to the caller. If we were to provide a switch, the cuttoff buffer size should probably be larger than with SHA and the like, because copying data is much cheaper than computing a cryptographic hash.

WJWH commented 4 years ago

I agree that high performance applications probably should use something other than large strict bytestrings and they do already mention that it'll take O(N) time. They don't mention that (for example) cons on a strict bytestring will block all other Haskell threads on the capability for multithreaded programs and this is probably not what most users would expect.

Apparently the difference between safe and unsafe is only a few hundred nanoseconds, so we could set the switchover value suitably high that the safe path would only be called when the overhead is less than a few percent?

vdukhovni commented 4 years ago

Right, I'm just hesitant to assume that the cutoff is comparable on all platforms, so would be fairly conservative with the buffer size (>= 128k? More?) to be sure that we're not too strongly biased to just X86_64 platforms.

Do you have measurements of how long it takes to copy N bytes for a few values of N on X86 and perhaps ARM? How big are the samples that motivated this issue?

vdukhovni commented 4 years ago

Also what is the cost of moving all the other threads off the capability so they get a chance elsewhere while the copy is happening? How does that scale with the number of threads?

WJWH commented 4 years ago

I was thinking more >512 MB or something like that :sweat_smile:. The crypto example is really very low since it's so cpu intensive even for fairly small chunks of data. For memcmp and memcpy I'd expect the cutoff to be much higher. To me, it's not so much about squeezing every last bit of performance out of this, just preventing some worst case scenario's when people are accidentally being stupid.

vdukhovni commented 4 years ago

OK, so you're on board for a fairly conservative cutoff, perhaps not quite that high, but some multiple MB sounds about right. We'll have to see what others think of this idea. I'm not opposed, though it will cost an extra branch on every dispatch, I don't know whether that's ever an issue...

Bodigrim commented 4 years ago

memcpy is used in several functions. I would like to see an analysis which of these functions are mostly used with smaller and which with larger inputs. And benchmarks about performance costs of branching and safe calls.

hsyl20 commented 3 years ago

We should really add memcpy/memset/memmove primops into GHC. GHC already has thresholds to inline internal calls (see https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GHC/CmmToAsm/X86/CodeGen.hs#L2194 for memcpy on X86). It could insert unsafe/safe calls with new thresholds if needed.

vdukhovni commented 3 years ago

We should really add memcpy/memset/memmove primops into GHC.

Indeed, see https://github.com/haskell/bytestring/issues/274