Open WJWH opened 4 years ago
One might say that, for large enough buffers, performance-sensitive applications should not be using strict bytestrings, and use Lazy
(chunked) ones instead or streams.
I am not sure that we should introduce automatic switching based on size, but we could perhaps leave the choice to the caller. If we were to provide a switch, the cuttoff buffer size should probably be larger than with SHA and the like, because copying data is much cheaper than computing a cryptographic hash.
I agree that high performance applications probably should use something other than large strict bytestrings and they do already mention that it'll take O(N) time. They don't mention that (for example) cons
on a strict bytestring will block all other Haskell threads on the capability for multithreaded programs and this is probably not what most users would expect.
Apparently the difference between safe
and unsafe
is only a few hundred nanoseconds, so we could set the switchover value suitably high that the safe
path would only be called when the overhead is less than a few percent?
Right, I'm just hesitant to assume that the cutoff is comparable on all platforms, so would be fairly conservative with the buffer size (>= 128k? More?) to be sure that we're not too strongly biased to just X86_64 platforms.
Do you have measurements of how long it takes to copy N bytes for a few values of N on X86 and perhaps ARM? How big are the samples that motivated this issue?
Also what is the cost of moving all the other threads off the capability so they get a chance elsewhere while the copy is happening? How does that scale with the number of threads?
I was thinking more >512 MB or something like that :sweat_smile:. The crypto example is really very low since it's so cpu intensive even for fairly small chunks of data. For memcmp and memcpy I'd expect the cutoff to be much higher. To me, it's not so much about squeezing every last bit of performance out of this, just preventing some worst case scenario's when people are accidentally being stupid.
OK, so you're on board for a fairly conservative cutoff, perhaps not quite that high, but some multiple MB sounds about right. We'll have to see what others think of this idea. I'm not opposed, though it will cost an extra branch on every dispatch, I don't know whether that's ever an issue...
memcpy
is used in several functions. I would like to see an analysis which of these functions are mostly used with smaller and which with larger inputs. And benchmarks about performance costs of branching and safe calls.
We should really add memcpy/memset/memmove primops into GHC. GHC already has thresholds to inline internal calls (see https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GHC/CmmToAsm/X86/CodeGen.hs#L2194 for memcpy on X86). It could insert unsafe/safe calls with new thresholds if needed.
We should really add memcpy/memset/memmove primops into GHC.
Indeed, see https://github.com/haskell/bytestring/issues/274
In https://github.com/haskell/bytestring/blob/master/Data/ByteString/Internal.hs#L747, we import
memcpy
as an unsafe FFI call, which means that it runs "inline" and will block all other Haskell threads running on the same capability in the threaded runtime. If the Bytestring being copied is very large, this might block the other Haskell threads for longer than is desired.In the crypto libraries, there are both safe and unsafe imports which are switched between depending on the size of the input. This trick might be useful here as well.
Interested to hear your opinions about this, I can pick it up if this is something desired.