haskell / vector

An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework .
Other
366 stars 139 forks source link

vector-0.12 breaks assumed alignment of Storable vectors #152

Closed NicolasT closed 7 years ago

NicolasT commented 7 years ago

The changes introduced in 6ddbbe81c3f6b090a301534b4bb331bc71e93117 change the alignment of allocated vectors to the natural alignment of a contained element. Before version 0.12, the alignment was whatever mallocPlainForeignPtrBytes does. In my experience mallocPlainForeignPtrBytes plays by the rules of what malloc does, i.e. (on x86_64) allocate 16-byte aligned vectors (see https://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html and the ABI), and also uses 16 byte alignment on x86 (I tested on ARM, Aarch64 and PPC64 in the past, and those also seem to use 16 byte aligned allocations, based on a QuickCheck test https://github.com/NicolasT/reedsolomon/blob/644c5bccb890e5c91cb66682cf8e212c6cf5cd60/test/Galois.lhs#L404)

Whilst this is wrong for elements which are larger than 16 bytes (which is what I assume the change wants to fix), this change now allows vectors of elements of less than 16 bytes to be allocated at a smaller alignment. As such, the alignment of Vector Word8 could be anything at all, whilst before it was 'guaranteed' to be 16 byte aligned, which causes breakage when such Vector Word8 is passed through FFI to some C function which uses SIMD aligned loads and stores (like my library does).

Pushing the 16 byte aligned requirement to a consumer of the library would somehow require to allocate a Vector Word16 and cast it to Vector Word8 (because in the end many storage applications work on bytes), which seems clumsy (at least) and isn't passed along allocating vector operations.

At some point in time I worked on some vector type which tracked minimal alignment as a type-level number, but that introduces a different API etc.

Given vector-0.12 potentially breaking any FFI using SIMD, I'd like to propose to change the allocation code to use max alignment 16 as the requested alignment of the allocation, to restore the previous behavior.

cartazio commented 7 years ago

so ... I'm not sure if i agree that this is strictly a breakage, because vanilla haskell code doesn't enforce any alignment of slicing guarantees.

that said, it is an issue that we dont have a composable way of guaranteeing this at the moment.

@NicolasT could you unbreak such codes by having a newtype

newtype CacheLineAligned a = CLA a 
instance Storable a => Storable (CacheLineAligned a ) where
    ...
   alignment  _ = max 16 (alignment (undefined :: a))
   ...
cartazio commented 7 years ago

the point is: that alignment guarantee wasnt specified anywhere as something we had to adhere to, So while i agree with your point that its a change, it wasn't an api contract but rather an accident.

otoh, @hvr @dolio @ekmett and other may have other useful thoughts on this.

but that said, having honest alignment requirements IS important, and before we were silently ignoring them, that this made certain codes using aligned simd C codes work before that dont currenlty isn't within the provenance of the current vector contract.

Perhaps it should be, but it certainly isnt today :)

NicolasT commented 7 years ago

@cartazio How would the CacheLineAligned trick work? If my code takes a Vector Word8 as input, and I then turn it into a Vector (CacheLineAligned Word8) through Vector.map CLA inputVector, inputs are likely to be copied, no?

In general, I think the 16 byte aligned contract is can be defended. One could argue any SIMD FFI code should be able to work with non-aligned inputs, which may work with a single input vector (just do byte-wise operations until you reach your alignment constraint), but not when you're dealing with multiple input/output vectors whose alignment difference between one another can be... anything. In C(++) you can have the 'Input must be X aligned' as an API contract, which means the caller should use posix_memalign or similar, but in Haskell there's no way to achieve this (using plain Vector).

In my library, I use 16 byte alignment as the 'default': as an example, in the AVX2 versions of my routines, which would require 32 byte aligned values for aligned loads, I fall back to unaligned load/store instructions by default. This doesn't really make much difference (on AVX2-enabled CPUs), and on recent CPUs aligned vs. non-aligned loads of 16 byte values (SSE2/.../AVX) shouldn't make much difference either, but on older CPUs it can.

As a temporary work-around I guess I could change my code such that the 'expected' alignment when using vector<0.12 is 16 bytes, otherwise 1 byte, but I think in the long run we need a better story/solution for this stuff.

I'm not overly familiar with Repa or Accelerate, but I bet if they do some kind of SIMD they may be impacted as well.

cartazio commented 7 years ago

You could do a smart wrapper that checks alignment and either just does a safe coerce if aligned or if not aligned does a copy. There is no way around this if you're only using aligned simd.

Point in fact: inner loops for modern high quality simd codes such as the Blis c library in general do copying to simplify alignment assumptions.

Likewise : the real issue i have with what I believe you are claiming is that we have no way to guarantee that any storable slice is cacheline sized and aligned. At which point you actually must have native cacheline sized short vectors as storable instances. Because otherwise slicing and stuff are broken period.

The change in vector was to more accurately reflect information in the storable type class.

Perhaps the real root of the issue is that storable doesn't provide both bulk and element wise alignment methods and you're saying those are conflated here?

On Tue, Jan 31, 2017 at 6:03 PM Nicolas Trangez notifications@github.com wrote:

@cartazio https://github.com/cartazio How would the CacheLineAligned trick work? If my code takes a Vector Word8 as input, and I then turn it into a Vector (CacheLineAligned Word8) through Vector.map CLA inputVector, inputs are likely to be copied, no?

In general, I think the 16 byte aligned contract is can be defended. One could argue any SIMD FFI code should be able to work with non-aligned inputs, which may work with a single input vector (just do byte-wise operations until you reach your alignment constraint), but not when you're dealing with multiple input/output vectors whose alignment difference between one another can be... anything. In C(++) you can have the 'Input must be X aligned' as an API contract, which means the caller should use posix_memalign or similar, but in Haskell there's no way to achieve this (using plain Vector).

In my library, I use 16 byte alignment as the 'default': as an example, in the AVX2 versions of my routines, which would require 32 byte aligned values for aligned loads, I fall back to unaligned load/store instructions by default. This doesn't really make much difference (on AVX2-enabled CPUs), and on recent CPUs aligned vs. non-aligned loads of 16 byte values (SSE2/.../AVX) shouldn't make much difference either, but on older CPUs it can.

As a temporary work-around I guess I could change my code such that the 'expected' alignment when using vector<0.12 is 16 bytes, otherwise 1 byte, but I think in the long run we need a better story/solution for this stuff.

I'm not overly familiar with Repa or Accelerate, but I bet if they do some kind of SIMD they may be impacted as well.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/152#issuecomment-276522238, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwmu06d5-vmzngzSDooaOT1VwMczNks5rX73egaJpZM4LzFs2 .

NicolasT commented 7 years ago

It's really not that simple: using an alignment check and then using the appropriate load/store instructions may work, but once we go outside the realm of x86, we're toast: ARM/AArch64 Neon or PPC64 Altivec don't even have non-aligned loads and stores, as far as I know. So passing in an unaligned input would have a major performance impact.

Copying wouldn't work either: inputs to a Reed-Solomon encode/reconstruct function can easily be tens of MB in size. Copying those around has an easily observable performance impact.

In C code, one can put alignment of whatever input vectors as part of the contract, and assert on entry, then leave it to the calller to ensure correct inputs (again, using posix_memalign or whatever). In Haskell this is basically impossible now, except by going through a vector of some other type and coercing.

Indeed, I didn't take slices into account, and my current code would indeed fail (even segfault) on them. I didn't consider this before, because de facto using RS calculations on a slice is an uncommon thing to do.

I'm starting to think I should introduce some kind of 'aligned vector' in my library, and a function to convert some vector in an aligned one, which then copies if required. This feels very ugly though, since it doesn't give the programmer any way to reason about runtime performance when inputs are of type Vector Word8 (i.e. ByteStrings from files, network,... The usual when dealing with RS).

Not sure how to handle this inside this package, but I do believe an answer for 'guaranteed minimal alignment' should be found. In the meantime, keeping a minimal of 16 bytes alignment of freshly-allocated vectors seems like a reasonable thing to do, in line with plain malloc guarantees.

cartazio commented 7 years ago

I think we're both saying true things.

Does my storable newtype make sense for getting the best of both worlds for the near term?

On Thu, Feb 2, 2017 at 7:37 PM Nicolas Trangez notifications@github.com wrote:

It's really not that simple: using an alignment check and then using the appropriate load/store instructions may work, but once we go outside the realm of x86, we're toast: ARM/AArch64 Neon or PPC64 Altivec don't even have non-aligned loads and stores, as far as I know. So passing in an unaligned input would have a major performance impact.

Copying wouldn't work either: inputs to a Reed-Solomon encode/reconstruct function can easily be tens of MB in size. Copying those around has an easily observable performance impact.

In C code, one can put alignment of whatever input vectors as part of the contract, and assert on entry, then leave it to the calller to ensure correct inputs (again, using posix_memalign or whatever). In Haskell this is basically impossible now, except by going through a vector of some other type and coercing.

Indeed, I didn't take slices into account, and my current code would indeed fail (even segfault) on them. I didn't consider this before, because de facto using RS calculations on a slice is an uncommon thing to do.

I'm starting to think I should introduce some kind of 'aligned vector' in my library, and a function to convert some vector in an aligned one, which then copies if required. This feels very ugly though, since it doesn't give the programmer any way to reason about runtime performance when inputs are of type Vector Word8 (i.e. ByteStrings from files, network,... The usual when dealing with RS).

Not sure how to handle this inside this package, but I do believe an answer for 'guaranteed minimal alignment' should be found. In the meantime, keeping a minimal of 16 bytes alignment of freshly-allocated vectors seems like a reasonable thing to do, in line with plain malloc guarantees.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/152#issuecomment-277130868, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwqhzYd3IxuWq3X-F--57wgIJa5Bkks5rYnbOgaJpZM4LzFs2 .

dolio commented 7 years ago

So, I think I agree with Carter that what you had before was no guarantee. mallocPlainForeignPtrBytes doesn't say anything about alignment, and it just calls newPinnedByteArray# which is a GHC primop. So it completely depends on how GHC chooses to manage its own memory (which it does mostly itself, independent of the system allocator). Even if you ran tests that indicate things were 16 byte aligned, GHC could just decide to break that in one release, or even on a particular architecture.

We could add a way of creating storable vectors with a particular alignment, but I imagine that wouldn't suit your needs, either, because you need all things that could create a vector to have your desired alignment, and you don't want to clone the entire API.

So, the best thing I can think of is also a newtype wrapper. You could beef up Carter's example if you want to get into some fancy types:

newtype Aligned (n :: Nat) t = A t

type family MinimumAlignment (t :: *) :: Nat
type instance MinimumAlignment Word8 = 1
...

instance KnownNat n => Storable (Aligned n t) where
    ...
    alignment = <some calculation involving n>

Then a storable vector of Aligned n t will be aligned to n bytes when created. And you could probably even arrange to get type errors if you to try to align to something less than MinimumAlignment t. Slicing would still be a problem, though, and I'm not sure how you'd eliminate it as a problem without moving alignment information to the vector, like you mentioned.

cartazio commented 7 years ago

as things stand, it seems to make sense to close this ticket, though I hope the discussion was fruitful for all involved. (if theres a concrete course of action thats different, lets zoom in on it in a fresh thread)