Several functions perform out-of-bounds memory accesses (which is UB)

RalfJung commented 9 months ago

I found this code which openly admits to doing OOB accesses:

https://github.com/rust-lang/compiler-builtins/blob/2a67ad74b77dd5c7dbd77a27156176136b0b606d/src/mem/impls.rs#L64-L70

This code also looks suspicious, e.g. if this is used to implement 2-byte atomic accesses then there is OOB here:

https://github.com/rust-lang/compiler-builtins/blob/56172fcd8bd045e38bbdf76697d1fcca1e965d6d/src/arm_linux.rs#L57-L89

I haven't done a thorough audit so there might be more. I found these by grepping for atomic_load_unordered.

RalfJung commented 9 months ago

OOB accesses using Rust accesses (including atomic and volatile) are always UB.

The only way to do OOB accesses is using inline assembly. Then the inline asm block has to ensure that the access can never cause a program abort (so it should be on the same page as an address that we know we could access from Rust).

RalfJung commented 8 months ago

If I understand https://github.com/rust-lang/rust/pull/113923 correctly, compiler-builtins can now participate in LTO, so there's no way to make an argument like "this crate is part of the privileged runtime so it can violate some of the usual rules of UB that we apply to all other code". This is a soundness issue.

bjorn3 commented 8 months ago

AFAIK the calls to compiler-builtins intrinsics are inserted somewhere in the backend well after optimizations. In fact that PR needed to explicitly mark these functions as used to prevent LLVM from optimizing all these intrinsics away as dead code.

RalfJung commented 8 months ago

That sounds plausible for calls to the LLVM builtin (making that a libcall happens pretty late), but if the user directly calls the C function memcpy, why would that not be inlined?

bjorn3 commented 8 months ago

It is possible that LLVM will replace explicit memcpy calls with the llvm.memcpy intrinsic. I don't know if it is in practice guaranteed to do this and if it is LLVM itself that does this, or clang when compiling C code though.

RalfJung commented 8 months ago

There now is a "masked load" intrinsic that can be used to do a large load that is partially OOB. Would that help to fix the OOB in this code without regressing performance?

Lokathor commented 8 months ago

I think it would take enough time to set up the mask that you could likely just read the initial byte chunk in a loop and have it be just as fast.

the8472 commented 7 months ago

Masked loads can still be useful for code size reductions if it allows to have only the vectorized loops without the pre/post scalar ones. Though that's most likely only worth it when proper masked-load instructions are available. On x86 that requires the x86-64-v4 target level.

vsukhoml commented 6 months ago

I think major issue is a discrepancy between notion of UB as per Rust and as per CPU. From CPU standpoint it is perfectly ok to read whole word from memory into register or cache if permissions allows that. And for most (all?) CPUs permissions can be set at minimum of word alignment (e.g. 32-bit, 64-bit, etc) for platforms with physical memory protection or at page-level. So the fact that it is ok to read 1-byte also means it is ok to read the whole word. Might be some target architecture flag can be set to indicate a case where strict byte/half-word reads shall be used if there are architectures where this property doesn't hold (webassembly? not sure where it is enforced at byte level) .

As for UB - yes, it is illegal to use extra bytes in the register besides intended, but if they are excluded from further computations, say by masking, it shall not matter for how they were read from memory. Under the hood CPU already read all bytes in cache line (if present), and if unaligned loads are supported it will issue multiple word reads on the bus and merge result.

But with atomic case above issue is worse - since atomic transaction can't be done in a single operation where atomicity is guaranteed by hardware, some kind of software synchronization (e.g. mutex) should be used instead as it is possible in the event of real contention that only half of word will be updated and picked by other core/thread. The code implementing 'best effort' to get this atomic property can behave incorrectly. Might be atomics shall only be aligned?

RalfJung commented 6 months ago

CPU-level reasoning is simply void when you talk about Rust code (unless it's inline assembly). So the CPU standpoint is really entirely irrelevant here.

vsukhoml commented 6 months ago

For Rust code - yes - it is internal implementation of hardware and reads of uninitialized happens all the time. It just for LLVM/Rust that this memory shall not be used prior to initialization.

At the same time for LLVM it start to make sense at lower level - what is the best way on specific hardware to read required data from memory while avoiding bringing uninitialized data? And techniques like masking, shifts seems fine for me. Yes, it can be more efficient to do 2 word reads and couple shifts, than to do 4/8 byte reads and more shifts. But it also means LLVM better to expose this sort of intrinsics and lower it on its level. At the same time, certain optimization may be possible if information is available at IR level - e.g. reading unaligned word from an array sequentially, where you can reuse result from reading values for previous element instead of reading 2 words again.

With atomic reads it is the same - if hardware doesn't support unaligned atomic operation - this operation shall probably cause a configurable warning of suboptimal implementation and LLVM can try to make a best effort to emulate it, but this probably shall be done at LLVM level, not at the Rust level. Some implementation requires support functions - and these support functions can be written in any language. If this is an intent for the code above - it is inherently wrong and should be rewritten with use of explicit synchronization, which itself can be OS specific.

RalfJung commented 6 months ago

I don't know what you are arguing for here. The code here is uncontroversiallly buggy. All these hardware-specific things you mention are entirely missing the point. This is true for Rust code and also for LLVM. LLVM IR is an abstract language with a subtle (and often not clearly specified) semantics, and out-of-bounds reads are UB. You can't weasel your way out of that by waving your hands and saying something about hardware. Please read this blog post that I already linked above to understand why this point is so important.

techniques like masking, shifts seems fine for me

Then you are wrong. They are not fine.

There is an alternative universe in which Rust and LLVM are specified in a way that allows these techniques, but we do not live in that universe. (And doing so has other downsides, discussing which is entirely off-topic here.) It doesn't matter whether they "seem fine", it matters what the involved languages (Rust, LLVM IR) document in their specification (which reflects what they assume in the analysis and transformation passes). And what they document is unambiguous: out-of-bounds reads and writes are UB.

Lokathor commented 6 months ago

So, switching the discussion a bit to finding a path forward: Should we simply rewrite the current code to have correctness at the possible expense of performance, and then people can get back that performance in later PRs if they can show that correctness is always maintained by each PR's changes, or what? I guess this is a T-libs discussion to have. Can we nominate the issue to come up at a meeting?

vsukhoml commented 6 months ago

I don't see a point for disagreement. I totally agree with UB treatment of uninitialized memory at IR and higher. And Rust code shall not be ok with any out-of-bound reads.

Then you are wrong. They are not fine.

They are performed by hardware level all the time. While it is totally fine for LLVM to declare axioms about UB and uninitialized memory for higher levels, these operations are still performed under the hood. What I'm saying is that at the level where IR is translated to machine code these operations may appear and sometimes what is called by generated code might be written in some higher than assembler language. E.g. imagine a platform where only word reads exists, but you need to deal with bytes. You'd have to emulate byte operations somehow. Now add uninitialized padding to equation. What is the proper solution here? LLVM might be better off with providing intrinsics and hooks to implement it where support from OS is needed. Very similar to how certain operation are handled by LLVM now. There should be some low level to implement safe way to get unaligned data suitable for particular platform and safe for higher levels. But "don't bring UB to the IR"...

vsukhoml commented 6 months ago

For the code above I'd think that correct implementation will be having conditionals compilation to have different paths for those platforms which don't care about alignment e.g. X86-64 and those which needs special implementations, potentially relying on OS for synchronization/mutex, to be correct. It would be nice if Rust expose intrinsics for explicit aligned and unaligned memory access on top of load atomic https://llvm.org/docs/LangRef.html#load-instruction with different specifications of alignment, though I'm struggling to find a way how to use it. Many solutions I think of are kind of kicking the can down the road - how to implement unaligned atomic as LLVM intrinsic - you would end up with calling a builtin function, and how to implement this function? So, either unaligned atomics are supported by platform or they are not and you need some other implementation and potentially compile time warnings. Same applies if you want to have atomic type of unsupported by hardware size.

RalfJung commented 6 months ago

Can we nominate the issue to come up at a meeting?

I don't think we can do that in this repo; we need to file an issue in the main Rust repo for that.

rust-lang / compiler-builtins

Several functions perform out-of-bounds memory accesses (which is UB) #559