ogxd / gxhash

The fastest hashing algorithm 📈
https://docs.rs/gxhash
MIT License
799 stars 27 forks source link

Consider using simd_masked_load for the Read Beyond of Death trick #82

Open matthieu-m opened 5 months ago

matthieu-m commented 5 months ago

Reading past an array bounds is unsound

While you are correct that at the machine code level, one can read past an array bounds without invoking UB -- because at the machine code level, there is no array bounds, only linear memory -- this is not the case in higher level languages such as Rust, or even LLVM IR.

It appears this does not trigger any harmful optimization yet, but it may at any moment (especially while upgrading), so it would be best to look towards replacing the current implementation.

That's what intrinsics are for

It's so common for SIMD algorithms to wish to read beyond array bounds that the Rust programming language has included explicit support for it under the form of the simd_masked_load (and its counterpart, simd_masked_store) intrinsic.

Using the intrinsic guarantees that a Rust compiler, and its various backends, will correctly understand the developer's intent.

Unfortunately, documentation is sparse, so further testing (and discussion with developers) may be necessary to assert the precise safety guarantees of the intrinsic -- for example, whether it automatically handles reads beyond the end of an OS page, or not -- and double-check how good the generated code is.

Also unfortunately, it is an unstable intrinsic (requires nightly), which may make it unsuitable for use at the moment, though it could be enabled with a feature flag for nightly users' peace of mind.

Or use inline-assembly.

A last resort implementation would be to directly use inline assembly. The implementation is small enough, and with only 2 target instruction sets, that it may not prove unduly complex, nor too much of a maintenance burden.

mert-kurttutan commented 5 months ago

An alternative approach is to use dispatching load to load with constant length.

#[inline(always)]
pub unsafe fn get_partial_safe(data: *const State, len: usize) -> State {
    if len == 4 {
        let partial_vector = _mm_castps_si128(_mm_load_ss(data as usize as *const f32));
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(4 as i8));
    }
    if len == 8 {
        let partial_vector = _mm_castpd_si128(_mm_load_sd(data as usize as *const f64));
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(8 as i8));
    }
    if len == 16 {
        let partial_vector = _mm_loadu_si128(data);
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(16 as i8));
    }

   # quick solution for compilation, this is incorrect for len other than 4,8,16
        let partial_vector = _mm_loadu_si128(data);
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(16 as i8));
}

The code above works only for len 4,8,16 (also for 12 if you combine 4+8 in another branch).


This obviously has some cons. 
- The implementations for non trivial input sizes might be cumbersome. For instance the size 5 would be a combination of 4+1. I haven't done the work to make sure these can be done (efficiently).
- Another con is the code does not seem optimal if you don't want write/maintain branches for each input size (1,2,..,16).
- Branching might lead to performance cost. But, I did not do any benchmark study where we have different condition for branch prediction. I only used your benchmark to measure performance.

So I am throwing this rather as an idea in case it is valuable to you based on pros/cons I gave.

Edit: I find it appropriate to write this here rather than as additional issue, you can tell me to move it to another issue if you want.
ogxd commented 5 months ago

Hello!

Reading past an array bounds is unsound

It may be unsafe, but I don't think it is unsound, because of the check performed. I understand your concern though that it might not work in the future and that it might not work in some very specific conditions. However, it currently works well (unless proven otherwise) and is one of the reasons this algorithm is an order-of-magnitude faster than other non-cryptographic algorithms. For most applications any other widespread algorithm should do the job just fine and won't be a bottleneck. For other applications, it is an informed choice to rely on GxHash.

That said, if there is a way to be safer without compromising on performance, we'd gladly take it. Here I tried using simd_masked_load as you suggested, however, performance is much worse and bytecode is heavier (done on ARM, maybe this is not helping).

With read-beyond:

Lloh3:
        ldr q1, [x8, lCPI1_0@PAGEOFF]
        cmgt.16b v1, v0, v1
        ldr q2, [x0]
        and.16b v1, v2, v1

With simd_masked_load:

Lloh5:
        ldr q1, [x8, lCPI1_0@PAGEOFF]
        cmgt.16b v1, v0, v1
Lloh6:
        adrp x8, lCPI1_1@PAGE
Lloh7:
        ldr q2, [x8, lCPI1_1@PAGEOFF]
        and.16b v1, v1, v2
        ext.16b v2, v1, v1, #8
        zip1.16b v1, v1, v2
        addv.8h h1, v1
        fmov w8, s1
        movi.2d v1, #0000000000000000
        tbnz w8, #0, LBB1_65
        tbnz w8, #1, LBB1_66
        tbnz w8, #2, LBB1_67
        tbnz w8, #3, LBB1_68
        tbnz w8, #4, LBB1_69
        tbnz w8, #5, LBB1_70
        tbnz w8, #6, LBB1_71
        tbnz w8, #7, LBB1_72
        tbnz w8, #8, LBB1_73
        tbnz w8, #9, LBB1_74
        tbnz w8, #10, LBB1_75
        tbnz w8, #11, LBB1_76
        tbnz w8, #12, LBB1_77
        tbnz w8, #13, LBB1_78
        tbnz w8, #14, LBB1_79
        tbz w8, #15, LBB1_40
        add x8, x0, #15
        ld1.b { v1 }[15], [x8]
LBB1_40:

@mert-kurttutan Unfortunately this approach will inevitably lead to a much bigger bytecode than the URBD and a lot more branching. I doubt it can be nearly as performant once finalized to handle all sizes.

phip1611 commented 5 months ago

Edit: Sorry, my understanding of UB was wrong, but I learned more about it recently.

Old Post > It may be unsafe, but I don't think it is unsound, because of the check performed I also support that it is unsafe but not unsound. Otherwise, if the original claim in the issue is correct, it would make every "external ptr + length = type/slice" unsound, right? However, this is a typical use case used in many crates, [such as in `multiboot2`](https://github.com/rust-osdev/multiboot2/blob/439ee87d1bace2fae4fbd5ce68e9f0dea941d662/multiboot2/src/lib.rs#L225), thus must be known to rustc/llvm. I think this is valid memory management.
RalfJung commented 5 months ago

It is definitely unsound to load more bytes than what has been allocated / than the reference points to. Miri may even support enough Intel intrinsics to be able to tell you that. :)

ogxd commented 5 months ago

It would be interesting to try! However regarding soundness, from the Miri readme itself:

Moreover, Miri fundamentally cannot tell you whether your code is sound. Soundness is the property of never causing undefined behavior when invoked from arbitrary safe code, even in combination with other sound code.

RalfJung commented 5 months ago

Miri cannot tell you whether code is sound, but it can (sometimes) tell you when code is unsound - namely, when it detects UB.

itamarst commented 4 months ago

It sounds like the original motivation was that std::ptr::copy() was too slow; curious if you ever tried std::ptr::copy_nonoverlapping()? (https://github.com/rust-lang/rust/issues/97022 suggests a manual version that's supposedly faster than the latter, but then final comment says that the compiler does better now...)

RalfJung commented 4 months ago

I also support that it is unsafe but not unsound.

Unfortunately you are wrong. This is not ambiguous or open for debate, the code is undoubtedly unsound: one can use it to cause UB from safe code. If an allocation has size 5 and you are reading 8 bytes, that's UB. Checking page boundaries makes no difference. It doesn't matter what the hardware does, what matters are the rules of Rust, and they are fairly clear here: the list of UB includes

"Accessing (loading from or storing to) a place that is dangling or based on a misaligned pointer."

and later

"A reference/pointer is "dangling" if it is null or not all of the bytes it points to are part of the same live allocation (so in particular they all have to be part of some allocation)."

Otherwise, if the original claim in the issue is correct, it would make every "external ptr + length = type/slice" unsound, right? However, this is a typical use case used in many crates, such as in multiboot2, thus must be known to rustc/llvm.

I am not sure why you think this would mean other code is unsound. That's just code accessing memory that is allocated by the outside world, not by Rust. It is allocating specific outside memory that it knows the size of. In contrast, you wrote a safe API and thus you carry the responsibility of ensuring that it works with all possible things that safe code could give you.

All you know is there are, e.g., 4 bytes of memory here. If you now read 16 bytes that's unsound. The soundness requirements for reading 16 bytes are simple: make sure there's actually 16 bytes of allocated memory there, inside an allocation. Arguing with pages and page boundaries is a category error, these terms don't even exist in the Rust Abstract Machine so it is fundamentally impossible to make a valid soundness argument with them. The only thing that matters are the actual boundaries of the Rust-level allocation.

ogxd commented 3 weeks ago

As said above:

if there is a way to be safer without compromising on performance, we'd gladly take it

Otherwise, we'll keep the code as it is. As with anything in life, it's all about compromises. Here we chose to make the fastest hash function possible, unfortunately at the cost of safety guarantees (at least for now). I've edited the readme to better emphasize on this, mentioning the usage of unsafe code and the UB.

If anyone has ideas to make such trick "safer" without deteriorating performance, feel free to open a new issue 😃

RalfJung commented 3 weeks ago

I've edited the readme to better emphasize on this, mentioning the usage of unsafe code and the UB.

Thank you for adding that. However, saying that "some people" qualify this as UB is misrepresenting the fact that there is no controversy over whether this is UB. It's not like there is a hard-to-answer or subtle question here, this is completely unambiguous. I am one of the people defining what is and is not UB in Rust, but if you don't want to take my word for it, please come to our Zulip and solicit some further opinions.

Unsafe Rust is a unique feature in that it can seemingly be used to "implement" things that Rust doesn't actually support. But that's an illusion -- if you write code that is UB, you didn't implement a fast hash function, you implemented a time bomb. With great power come great responsibility, and that responsibility includes respecting the boundaries of what is and is not possible with unsafe Rust. I wish we had all features everyone wants all at once, but for obvious reasons, that's not how it works in the real world. The responsible response is to either help develop the missing feature, or use other features we offer (such as inline assembly).

I should also emphasize that the Rust community generally has a zero-tolerance stance on Undefined Behavior. This is for a good reason: Undefined Behavior anywhere in the program fundamentally undermines all the hard work everyone has done to rule out Undefined Behavior behavior everywhere else -- all the effort in developing the borrow checker, all the effort in structuring code to fit within safe Rust, all the effort in developing safe abstractions around unsafe Rust that can be used with confidence. A single bad apple can spoil the entire bunch, that is just the nature of UB. As some who spent thousands of hours on making Rust safer, it is disheartening to see decisions like this undermine all my efforts.

ogxd commented 3 weeks ago

Thank you for putting in all these efforts. I think your point is clear and I can understand why you feel like it can be "disheartening", however you must understand a few important things:

RalfJung commented 3 weeks ago

Well, you put the crate on crates.io, which is an invitation for others to use it -- and generally comes with the expectation of "no UB". Maybe one day we need to introduce a flag on crates.io so that crates can self-identify as "this isn't currently intended to be sound" (and cargo can be instructed to avoid such crates), but right now all we have here is an implicit social contract.

Thanks for reopening the issue; it is one thing to say "I know there's a soundness bug but I currently don't want to fix it" and another to close the issue, which conveys "there's not a bug here".

FWIW you are by far not the first to want this feature, see https://github.com/rust-lang/unsafe-code-guidelines/issues/2. I'd love to work on a way to do this in a UB-free way, but believe it or not, as a professor I have less time to work on things like that than I used to as a PhD student... and the way academia works, it's also hard to delegate this kind of work. A large part of what I do for Rust is a side project I pursue in my free time.

ogxd commented 3 weeks ago

I've renewed the experiment using simd_masked_load, but this time I pushed the changes so that I could see what the results are on X86 (because I use an ARM MacBook)

TL;DR; Performance is still worse on ARM (as seen here) but very good on X86! I do not have the X86 bytecode but this is already a step forward. I'm not sure how far we could go with this however since this required core_intrinsics feature which is "strongly discouraged".

x86

image

ARM

(notice the red line for small input sizes)

image

Here is the draft branch/PR: https://github.com/ogxd/gxhash/pull/98

hkratz commented 2 weeks ago

Instead of the intrinsic you could use core::simd::Simd<T, N>::load_or_default.