I need to do an oob vector load. How?

brson commented 6 years ago

As an optimization during a buffer search, I need (very want) to load that buffer into a SIMD vector, even when the buffer doesn't fit into the vector. E.g. I might have a 31-byte buffer that can be efficiently searched with a 32-byte wide AVX2 vector.

From a machine perspective, I don't see this as a problem, as long as the load doesn't extend beyond the current page; from LLVM's perspective this seems like UB.

I'd really like to be able to write this code in Rust and not have to use assembly.

Here's an example of this pattern:

    #[inline(always)]
    unsafe fn do_tail_clever(needle: u8, p: *const u8, len: isize,
                             i: isize, q: __m256i) -> Option<usize> {
        let rem = len - i;
        debug_assert!(rem < 32);

        // Check if the 32-byte load is within the current page
        let page_alignment = 4096;
        let page_mask = !(page_alignment - 1);
        let current_p = p.offset(i) as usize;
        let avx_read_end = current_p + 32;
        let next_page = (current_p & page_mask) + page_alignment;

        if likely(avx_read_end <= next_page) {
            let x = _mm256_loadu_si256(p.offset(i) as *const __m256i);
            let r = _mm256_cmpeq_epi8(x, q);
            let z = _mm256_movemask_epi8(r);
            let garbage_mask = {
                let ones = u32::max_value();
                let mask = ones << rem;
                let mask = !mask;
                mask as i32
            };
            let z = z & garbage_mask;
            if z != 0 {
                return off(i, z);
            }

            return None;
        }

        // Slow path
        do_tail_simple(needle, p, len, i, q)
    }

It loads beyond the array, does vector operations on it, then disregards the oob bytes with a mask.

I'm hopeful that there is some mechanism to tell LLVM to 'forget' what it knows about this pointer, 'fooling' the optimizer into not messing with it.

From the LLVM aliasing rules, there is some language that makes me hopeful:

An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM.

So there is a class of pointers that can operate on arbitrary memory (those that don't come from LLVM). That suggests to me that I could e.g. send my pointer through assembly or some other black-box function to 'clean it', maybe. On the other hand, calling into any function, or even into inline asm imposes extra instructions that more-or-less defeat the optimization (inline asm in LLVM seems to always spill registers). Though that sentence also says "such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM"

I'm not sure how much 'wiggle-room' there is. Is a malloc'd array "provided by LLVM"? What are the consequences of disobeying this "shall not"?

Even if there's no in-language solution and it is technically UB, I am hopeful that I can do this thing without LLVM messing with my codegen.

cc @nikomatsakis writing this here per your request.

RalfJung commented 3 years ago

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.^^ That's why I'd prefer to keep cross-page accesses out of the discussion. shrug

chorman0773 commented 3 years ago

I just think it is easier to solve these problems in isolation than trying to solve more problems at once.

Possibly. In my opinion, the cross-page access problem isn't necessarily being solved directly, it's just being solved as a side-effect of solving the main problem, though I can see the opposite argument. In either case, the rule I proposed for read_volatile doesn't necessarily need the cross-page rule (and going accross pages could just be made into blanket UB). So that can be removed if we are completely adamant against solving the problem now (or if the proposed solution is deficient in some reasonable manner), and then what has been proposed can be used to direct future solutions if and when one is needed or desired. However, if it is a reasonable solution, I don't see why it can't be adopted now.

comex commented 1 year ago

(Two years later…)

This pattern came up as a concern in an LLVM discussion about changing uninitialized reads to return poison instead of undef:

https://discourse.llvm.org/t/rfc-load-instruction-uninitialized-memory-semantics/67481/4

JakobDegen commented 1 year ago

Briefly discussed in backlog bonanza: This is still open. Rust does not support it today, but it seems plausible to have in the language at some point

RalfJung commented 5 months ago

We actually now have an intrinsic that can do something like this: simd_maksed_load. However, you need to produce a mask that indicates which parts of the vector are in-bounds and which are not.

rust-lang / unsafe-code-guidelines

I need to do an oob vector load. How? #2