haskell / bytestring

An efficient compact, immutable byte string type (both strict and lazy) suitable for binary or 8-bit character data.
http://hackage.haskell.org/package/bytestring
Other
291 stars 140 forks source link

Sliced interface for ShortByteString #541

Open oberblastmeister opened 2 years ago

oberblastmeister commented 2 years ago

I think we could take some inspiration from byteslice. This would also address https://github.com/haskell/bytestring/issues/193, by expanding the api for ByteArray based types instead of changing ByteString to use ByteArray.

The docs for ShortByteString say

It is suitable for use as an internal representation for code that needs to keep many short strings in memory, but it should not be used as an interchange type. That is, it should not generally be used in public APIs.

Having a sliced ShortByteString would allow using a ByteArray based type in public APIs, similar to ByteString.

Bodigrim commented 2 years ago

I'm reluctant to extend bytestring API even further, it's already difficult to maintain in sync and there is no particular reason for SlicedShortByteString to be here, since it is not used by other boot libraries.

I think pushing forward https://github.com/haskell/primitive/pull/354 and #410 is a more sustainable approach: when merged, byteslice will operate over slices of ShortByteString.

oberblastmeister commented 2 years ago

In my opinion implementing functions directly on ShortByteString is not optimal. Functions on ShortByteString can be implemented trivially by using slices and should have no overhead because of inlining and simplification. Also, the functions currently copy by default which I think should be more explicit, such as by copying the sliced variant. In theory we could move all the functions from ShortByteString to the sliced variant and have conversion functions for people that want to use ShortByteString, and it would be okay. For example the any function can be easily converted for slices

any :: (Word8 -> Bool) -> Slice -> Bool
any k = \Slice {ba, off, len} ->
  let
      w = indexWord8Array ba
      go !n | n >= len    = False
            | otherwise = k (w n) || go (n + 1)
  in go off

and now any for ShortByteString is easy

any :: (Word8 -> Bool) -> ShortByteString -> Bool
any f = Slice.any f . asSlice

So most of the code is just moved instead of duplicated, and I don't think it would be that much harder to maintain, but of course that is your decision. I think adding this type would help unify vector, bytestring, and text because ByteArray can be cheaply converted between them. This would also encourage people to use ByteArray based types instead of ByteString which always has to be pinned.

Bodigrim commented 2 years ago

@oberblastmeister that's understandable goal, but as long as ByteArray from base is shared by all core packages, someone can implement Slice outside of bytestring. I agree that this is likely to cause certain code duplication, but at the moment I do not have bandwidth to include Slice into bytestring itself, at least not without seeing a prior art elsewhere.

@sjakobi what do you think?

oberblastmeister commented 2 years ago

I went looking through filepath as that is the primary consumer of ShortByteString. A large amount of functions are ineficient from copying. The most egregious example is splitDirectories. It would be pretty funny if the new ByteArray based functions were slower than the String based ones. On Windows, it is, by $13 \mu s$! The String version is almost 2 times faster than the ByteArray version! On posix, the situation isn't much better, with splitDirectories taking $7 \mu s$. Windows functions seem to suffer much more from copying because of the larger path size. A quick analysis of splitDirectores gives 112 total copies. The benchmark uses a path with 28 segments, splitPath does 3 copies for each segment and dropTrailingPathSeparator does about 1 also. And it is even worse, splitPath copies the accumulator. This makes the effect of copies even larger, as we are basically getting $n^2$ copying behavior. With slices, there would be zero copies. Clearly, we need slices to improve performance.

Bodigrim commented 2 years ago

Well, Slice adds 16 byte overhead for each chunk, which in a typical use case of filepath is worse than copying content. Your benchmark with 28 segments looks a bit contrived to me, and FWIW splitDirectories can use (Int, Int, ShortByteString) internally to prevent quadratic behaviour, and copy only resulting chunks, not accumulator.

More generally, filepath can define data Slice = Slice Int Int ShortByteString and use it internally to optimize intermediate operations. I think this is the best balance of cost/benefit right now. If it works well, in future we can discuss this issue further.

clyring commented 2 years ago

If the cost of copying some bytearray is significant, it usually makes sense to tell the GC not to copy it, by pinning that bytearray. So I don't find the idea of sliced-unpinned-bytestring all that persuasive on the face of it. (Of course, ByteString is typically still a bit more memory-expensive than simple sliced-pinned-bytearrays would be because its representation involves an extra two-word PlainPtr heap object. But I hope to eventually improve GHC so that PlainPtr does not actually allocate a new heap object.)

But in many cases I expect the observed "copying cost" to actually be dominated by the allocation cost of the buffers in which the copies are placed. And unlike copying in itself this is unaffected by pinnedness. The allocation overhead for String is (with the obvious exceptions) truly horrible: So horrible that for typical or even somewhat-large filepath inputs it doesn't surprise me that $O(n)$ string vs $O(n^2)$ shortbytestring allocations the shortbytestrings can still perform comparably or better.

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

oberblastmeister commented 2 years ago

Your benchmark with 28 segments looks a bit contrived to me

It's from filepath.

More generally, filepath can define data Slice = Slice Int Int ShortByteString and use it internally to optimize intermediate operations. I think this is the best balance of cost/benefit right now. If it works well, in future we can discuss this issue further.

Sounds good.

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

How would you do this? Something like type classes?

clyring commented 2 years ago

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

How would you do this? Something like type classes?

Type classes can do this, but then un-specialized versions of every function will be generated even though they should never be used. Backpack is perhaps the "correct" tool for the job, but it's not available with the older versions of GHC and Cabal still supported by bytestring, and would also probably require import cycles to use in this way. So I had in mind to use CPP to share the implementations, if/when I get around to this.

hasufell commented 3 months ago

More generally, filepath can define data Slice = Slice Int Int ShortByteString and use it internally to optimize intermediate operations. I think this is the best balance of cost/benefit right now. If it works well, in future we can discuss this issue further.

Sounds good.

Yes, I've already thought about that. PRs welcome. But I don't really want to re-implement byteslice package.