Open dheaton-arm opened 3 months ago
These APIs as-is seem very dangerous to use from Rust, see the discussion here and in https://github.com/rust-lang/miri/pull/3787. The docs for these intrinsics should provide guidance for how to use them correctly in Rust.
Given that LLVM exposes intrinsics for each of these operations directly, is there as much concern with it tracking these changes incorrectly? I would expect LLVM to expose intrinsics that function correctly within its own model.
Though, certainly, I could see an argument that these should perhaps operate on a TBIBox (or similar) in Rust, differing from the ACLE.
I would expect LLVM to expose intrinsics that function correctly within its own model.
That is, unfortunately, not a valid assumption. LLVM is often internally incoherent. For instance, they also expose operations for non-temporal stores but those do not behave properly in the LLVM memory model (Cc https://github.com/llvm/llvm-project/issues/64521). They also expose intrinsics to set the floating-point status register, but it would be UB to actually change that register to anything else (unless special attributes are set on the surrounding code).
LLVM often leaves it to their users to figure out which parts of LLVM can be used together correctly, and which cannot.
Cc @rust-lang/opsem
I am in particular concerned about docs like this
/// SAFETY: The pointer provided by this intrinsic will be invalid until the memory
/// has been appropriately tagged with `__arm_mte_set_tag`. If using that intrinsic
/// on the provided pointer is itself invalid, then it will be permanently invalid
/// and Undefined Behavior to dereference it.
pub unsafe fn __arm_mte_create_random_tag<T>(src: *const T, mask: u64) -> *const T;
We have to make sure codegen backends understand what is going on here -- I am not sure where the provenance updates and realloc are happening.
I also don't really understand what __arm_mte_set_tag
does -- it says it sets a tag for a 16-byte chunk of memory, but what does that mean? Does it mean that all accesses to this chunk must be done with a pointer that has that tag?
Does it mean that all accesses to this chunk must be done with a pointer that has that tag?
Yes, that's correct. The idea behind MTE is that we store a 4-bit tag in the top byte of every virtual memory address - that part is handled by the kernel. By default all of the tags are 0000, so all preexisting code already works with MTE.
In this example, __arm_mte_create_random_tag
would use the irg
instruction and generate a random 4-bit tag for the given pointer.
__arm_mte_set_tag
then takes a tagged pointer and uses the stg
instruction to actually tag the memory address that the pointer points to with the tag.
Once tagged, if MTE is enabled for the process/thread, every access to that memory address has to be done with a pointer tagged with a matching tag. If the two don't match, the hardware will fault and the process will get a SIGSEGV.
If it can be made to play nice with Rust, MTE could potentially be useful for extra pointer provenance checks. E.g. if we could guarantee that adjacent allocations will always have different memory tags, we could preserve provenance information across pointer-usize-pointer casts and still be able to tell the difference between two pointers with the same address but different provenance. For example:
(imagine we have tagged pointers 'a' and 'b' to adjacent memory locations from different allocations)
let c: usize = (a+1) as usize;
let d: usize = b as usize;
if c == d {
// unreachable because even if the address of a+1 is the same as b the tags will not match
}
// convert c and d back into pointers - same addresses, different tags
let d_ptr = d as *const i32;
let c_ptr = c as *const i32;
*d; // tag matches - dereference allowed
*c; // tag does not match - hardware fault
The general problem is that in Rust (and in C), you can't just offset a pointer by tag << 56
and use it and expect that to make sense -- even if the hardware ignores the top byte, the compiler "sees" that you have offset this pointer way out of bounds and this is UB. I left some links above in my earlier post to the discussion about this.
Yes certainly, there's more to the story than just the hardware. The point is that if compilers can be convinced to play along, there are a lot of interesting use cases, benefits and implications for e.g. strict provenance that could be gained.
The tricky bit is figuring out how to do that. Arguably I'd say that if a compiler sees tag << 56
as offsetting out of bounds, that is an issue with the compiler. The top byte is not part of the address on any 64 bit architecture I can think of, so changing those bytes is not actually offsetting or moving the pointer. It still points to the same address because those bytes do not determine the address.
For instance, if I print out the address of some Box I have, I get this 0x6000024dc030
- 48 bits. Obviously the remaining bits are still there if I convert it to usize, but they won't be displayed if I print it out because it's not part of the address.
It still points to the same address because those bytes do not determine the address.
Hardware doesn't get to just re-define what the "address" is. Rust (and C) have defined the address to be the entire representation of the pointer. I don't see any easy way to change this, the assumption is baked very deep into LLVM. (But maybe it's not so hard to change in LLVM -- I don't know, I am not an expert on LLVM internals. That would be a conversation to be had with the LLVM people. For the purpose of this discussion I will assume LLVM stays unchanged, until someone points at a design that makes LLVM getelementptr
and other in-bounds rules work coherently with these intrinsics.)
Arguably I'd say that this is a case of hardware engineers not thinking about the fact that people generally write code in "high-level" languages (like C or Rust), not in assembly...
Hardware can't just unilaterally change the rules and expect that to make sense. There are multiple complicated abstractions interacting here (the ISA and the surface language), and they have to be carefully designed in tandem. Unfortunately one part was now designed and shipped in silicon without considering the other part, so making the entire story work out nicely will be tricky.
For instance, if I print out the address of some Box I have, I get this 0x6000024dc030 - 48 bits. Obviously the remaining bits are still there if I convert it to usize, but they won't be displayed if I print it out because it's not part of the address.
If you print the pointer that is returned from __arm_mte_create_random_tag
, it will print all the bits, not just the low 48 bits. (Unless someone changed this recently, but then we should revert that change as it would be wrong. Or at least, it would disconnect the printed value from the actual underlying reality, which involves all the bits. This is not something you can just change by fiat by changing what gets printed, that's not how language design works.)
Hardware doesn't get to just re-define what the "address" is. Rust (and C) have defined the address to be the entire representation of the pointer
True, but then I'd also say that it's the OS that should be defining what an address is. To take Linux as an example, Linux does not consider the top bits part of the address. The top bits are either 0*
for userspace addresses and f*
for kernelspace addresses, they're not actually used beyond that. If Rust and C both think that the whole 64 bits are the address, that is not really a correction assumption.
Agreed on the LLVM side, I'm not an LLVM expert either but I will try to consult some about all this.
Or at least, it would disconnect the printed value from the actual underlying reality, which involves all the bits. This is not something you can just change by fiat by changing what gets printed, that's not how language design works.)
Oh yes for sure. This was not a "this proves my point" example, just a somewhat silly illustration, you're obviously right.
The fundamental question here is indeed the one of "what constitutes an address". I agree with the broader approach of programming for the abstract machine rather than a real one, but in designing the abstract machine we also need to consider what it actually runs on. If the abstract machine runs on an OS that doesn't use the top byte which runs on an arch that doesn't use the top byte, then pretending as if the top byte is part of the address for the purposes of the abstract machine seems a little pointless.
Even more so since these mechanisms are already widely used in practice. Every heap allocation on every Android phone that's been updated in the last couple of years already keeps a tag in the top byte: https://source.android.com/docs/security/test/tagged-pointers
I'd say the language gets to define what values in the language mean. :shrug: Anyway it's kind of moot to discuss who is "supposed to" define this, the fact is that LLVM (and likely GCC) have defined this, and there are very good reasons for defining it the way they do that make it hard to change. We can disagree on whether we think this was a mistake or not, but it is the status quo.
Every heap allocation on every Android phone that's been updated in the last couple of years already keeps a tag in the top byte:
If the tag is set by malloc
, then everything is fine. As far as Rust is concerned, it is now part of the address and can never change for the lifetime of this allocation.
It is only changing the tag of an already created allocation that causes problems.
The top byte is not part of the address on any 64 bit architecture I can think of, so changing those bytes is not actually offsetting or moving the pointer. It still points to the same address because those bytes do not determine the address.
(getting a bit off topic here, but) For a concrete example of an architecture where this is not the case: This is not allowed on x86_64, trying to access such a non-canonical address in almost any way causes a fault (which will likely be handled by the OS as a SIGSEGV or similar if not baremetal).
3.3.7.1 Canonical Addressing In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros. Intel 64 architecture defines a 64-bit linear address. Implementations can support less. The first implementation of IA-32 processors with Intel 64 architecture supports a 48-bit linear address. This means a canonical address must have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one). Although implementations may not use all 64 bits of the linear address, they should check bits 63 through the most-significant implemented bit to see if the address is in canonical form. If a linear-memory reference is not in canonical form, the implementation should generate an exception. In most cases, a general-protection exception (#GP) is generated. However, in the case of explicit or implied stack references, a stack fault (#SS) is generated.
(from the Intel® 64 and IA-32 Architectures Software Developer’s Manual)
You can tag pointers using these high bits on x86_64 (if you know your target machine has a small enough virtual address space), but the tag must be removed before using the pointer, and tagging as such is still a (wrapping) offset in Rust semantics IIUC.
For a concrete example of an architecture where this is not the case: This is not allowed on x86_64, trying to access such a non-canonical address in almost any way causes a fault
That is completely beside the point I was making, and while you're right about canonical addressing being a thing that does not make the top byte part of the address on x86_64. By default aarch64 behaves the same as x86_64, that's why TBI is its own architecture feature that has to be enabled. Non-TBI aarch64 will trap if you try using a tagged pointer too. The point I was making is that the top byte is not part of the address, which is true for x86_64 just as much.
There are slight differences but x86_64 has its own TBI variants too, and having those enabled makes the cannonicality check you quoted behave differently than it normally does.
What I'm saying is that from the OS & arch side 64-bit addresses look more like this:
0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 1 1 1 1 | 1 1 1 1 1 0 1 0 | 1 0 1 1 1 1 0 1 | 1 1 0 0 1 1 1 1
63 - 56 55 - 48 47 - 40 39 - 32 31 - 24 23 - 16 15 - 8 7 - 0
| not address | maybe address | actual address |
And so pretending from the language side that all 64 bits constitute an address is simply not correct as soon as your code runs on any machine. The top byte would only need to be used for addressing if we wanted to address more than 65536 TiB of memory, which is unlikely to happen anytime soon to say the least.
The point I was making is that the top byte is not part of the address, which is true for x86_64 just as much.
I don't agree -- if setting that byte to the wrong value leads to a segfault, I would say it surely is part of the address. Unless you have what I would consider a somewhat odd definition of "address"... but as I said it's moot. All 64 bits are treated entirely uniformly. They must all have the exact right value to make the access valid. Whether you call the highest bits "not address but must be zero" or "part of the address" makes no difference at all, so let's not waste time debating that point.
The kind of pointer tagging where all accesses to a heap allocation use the exact same high bits are completely compatible with this. For Rust, those high bits are "part of the address"; we can invent new terminology for this if you insist but it doesn't make a difference.
The kind of pointer tagging where the allocation "moves around" the 64bit address space (because the high bits change) is not compatible with the LLVM and Rust memory model. They need to be exposed with a realloc
-like operation, which does both -- return a new pointer and logically move the data to that new location, and invalidate all previously used pointers.
if setting that byte to the wrong value leads to a segfault, I would say it surely is part of the address
Except that whether it does or does not is up to the system the code runs on. If you have TBI/UAI/LAM enabled, you can set it to whatever you want and the hardware/OS will not care, because the actual address part of the pointer has not changed.
I suppose my issue here is that coming at this from the assumption that an address is 64 bits quickly leads to contradictions and behaviour that makes no sense. The address space of the vast majority of systems is 256 TiB. If I set bit 56 to 1, I get an 'address' which would be in the 64th PiB of memory. That is simply outside the address space. You cannot access that memory address because such an address does not exist, the OS doesn't have it and the CPU will fault if you try.
If the abstraction of a 64-bit address space was true, you'd be able to take an address like 0x0000ffffffffffff
and then add 1 to access the next chunk of memory. You cannot do that because you have now run out of the address space and the pointer you have is no longer an address of anything that's actually in memory. You can either interpret it as invalid altogether, or interpret it as an address + metadata, but an address it is certainly not in any practical sense of the word.
They need to be exposed with a realloc-like operation, which does both -- return a new pointer and logically move the data to that new location, and invalidate all previously used pointers.
We can do the realloc hack for sure, I'm just trying to explore this a bit more because it seems to me that it is just that - an ugly hack to paper over the compiler incorrectly modelling the platforms the code actually runs on. I think talking about the allocation "moving around" when you change the high bits is just inaccurate because there is no address space there to move around in.
Essentially, if your OS provides 256 TiB of virtual memory but the correctness of your compiler relies on the assumption that a given allocation has been allocated in the 64th PiB of virtual memory, I just think that assumption is wrong. I understand why it's there, I understand it's much easier to assume that the leading 0s are actually just part of the address instead of blank metadata, but doing so leads to problems like this as soon as the hardware & the OS try to make use of those bits. Which is what those bits are there for.
you'd be able to take an address like 0x0000ffffffffffff and then add 1 to access the next chunk of memory
That would be true if that memory was allocated. But it's not. 0x0000ffffffffffff+1 behaves just like every other address that is not currently allocated. Just because on Linux that address will never be allocated (as far as we think today), is not sufficient justification for treating it fundamentally differently.
But as I keeps saying, this is a pointless attempt at re-defining certain terms, without changing any of the fundamental facts. The underlying problem is: having distinct addresses (or whatever you want to call the 64bit thing that is the input to a load/store operation) all access the same memory changes some fundamental properties of memory. Ignoring 4 bits of the 64bit address is basically equivalent to having the same pages mapped 2^4 times in different parts of memory. Changing the tag of a pointer is equivalent to doing pointer arithmetic between these different "mirrors". If compilers were written under the assumption that all memory can have such mirrors, that would make them worse at their job of optimizing code for the common case where no such mirrors exist. Therefore basically all optimizing compilers make the very reasonable assumption that memory they work on is mapped only once, and special care is needed if you violate that assumption. Which mechanism you use to violate that assumption (mmap'ing the same page multiple times, or instructing the hardware to ignore some bits of the "address") is entirely irrelevant.
We can do the realloc hack for sure, I'm just trying to explore this a bit more because it seems to me that it is just that - an ugly hack to paper over the compiler incorrectly modelling the platforms the code actually runs on.
I would say the ugly hack here is on the hardware side, by having it ignore parts of the input. But I guess we won't come to an agreement on this and it doesn't really matter for this discussion anyway. :shrug:
I think talking about the allocation "moving around" when you change the high bits is just inaccurate because there is no address space there to move around in.
If you remap the same physical pages elsewhere in virtual memory, do they "move"? You could argue either way. This is a similar situation. I can see your perspective, but please don't insist on it being the only perspective.
Essentially, if your OS provides 256 TiB of virtual memory but the correctness of your compiler relies on the assumption that a given allocation has been allocated in the 64th PiB of virtual memory, I just think that assumption is wrong.
That's not the assumption compilers are making. See the paragraph above for what the actual assumption is. (I touched on this before when I said that the key thing is that the bits must all be fixed, not that they must be 0.)
OSes change how much virtual memory they provide -- Linux switched from 48bit to 56bit at some point in the not-too-distant past. It's a good thing that we didn't hard-code any assumption like that into our compilers.
Which is what those bits are there for.
No, I don't think you can just unilaterally claim "ownership" of those bits here.
I can see your perspective, but please don't insist on it being the only perspective.
For sure, I can see your perspective as well! I don't think one is particularly more valid than the other, it just depends on whether we start off from the hardware & OS side or from the language side of things. I think we get the best results by doing both and meeting somewhere in the middle, which is what these kinds of discussions are great for :)
No, I don't think you can just unilaterally claim "ownership" of those bits here.
I mean that in the sense that hardware across multiple architectures was designed in a way that does not make use of those bits by default, with the aim of using them for something in the future. The industry is trending towards increasing memory safety and such, memory tagging extensions that use those bits (or at least allow using those bits) are already present on many architectures as mentioned before and run on countless devices. Meaning, I don't think those extensions are going anywhere and if anything they'll only be used more and not less. Nobody is now going to expand the actual address space to the full 64 bits because that'd break all the memory tagging use cases. If anything at some point down the line we'll get 128 bit architectures with once again unused top bits for tagging and metadata. For all intents and purposes those bits are there for the tags. Our options here are to try and ignore it, or to try and make it work well with languages & compilers.
For the sake of argument & from a purely practical standpoint, what's stopping Rust (or I guess more specifically LLVM) from adopting this alternative view of what a memory address is? It seems to me that all the aliasing and mirroring problems you list are only problems if the compiler accounts for all 64 bits, as opposed to effectively masking out the top ones before considering it as an address. If the compiler did that, then suddenly it's not "two mirrored addresses" but "the same address" (maybe with a tag but irrelevant) which matches the underlying platform the code will run on much better.
Just in case it's not clear from the tone of the discussion, I do agree with you on what the current approach of making TBI/MTE work within the current Rust/LLVM memory model should be. Just trying to explore if there's more that could be done but that's more of an academic discussion rather than an actual immediate proposal :)
For the sake of argument & from a purely practical standpoint, what's stopping Rust (or I guess more specifically LLVM) from adopting this alternative view of what a memory address is?
LLVM currently assumes that if you do ptr.wrapping_offset(foo)
, then the result is a different pointer (unless the offset in bytes is 0). This is very useful for alias analysis to figure out which pointer accesses might conflict with which other pointer accesses. If ptr.wrapping_offset(1 << 56)
suddenly returns a pointer to the same memory as ptr
, we make alias analysis decidedly weaker.
I don't know how big the impact on performance would be that this loss of alias information has, but it would surely be non-trivial to even figure out all the places where the compiler makes this assumption.
It's also quite bad that the underlying behavior here becomes so non-portable; generally it is a goal of Rust to make program semantics consistent across targets. That is one reason why we don't expose the x86 or ARM concurrency models, but instead have our own language-level concurrency model (specifically the one from C++) -- people generally don't want to write a version of their concurrency algorithms for each architecture. But here we'd have to say something like "if you offset your pointer by 1 << 56
bytes and access it, then sometimes this is UB and sometimes this behaves exactly like ptr
"... that's pretty bad from a specification perspective, and would not be a fun model to program against. (And no we can't say "using a wrong tag causes a trap", we have to make it UB or we lose even more optimizations as we could no longer reorder load
operations with each other.)
To me as a language person, a realloc-like API actually seems like a pretty nice way to expose these hardware features. I guess what is and is not a hack is in the eye of the beholder. ;)
Yeah that makes sense, this being incompatible with alias analysis as it currently stands is pretty unfortunate but most likely not really fixable in practice, as you said, who knows how many assumptions compilers make about this and where.
I suppose in practice not being able to support "full TBI" is probably not that much of an issue. Hardly any use-cases will want to change the tag after the allocation, and for the FFI-related ones that do we can provide the TBIBox to do reallocs and make things work under the hood without messing with the memory model.
If we eventually want to make more extensive use of pointer tagging in Rust (like for pointer provenance checks), we can always look into tagging pointers when the memory is allocated in the same way that Android use-cases currently do, then it's still fine in the current memory model as you said.
Thanks for the discussion, I at least found it very informative! :)
for the FFI-related ones that do we can provide the TBIBox to do reallocs and make things work under the hood without messing with the memory model.
API-wise I think I'd prefer if we had a raw pointer API for these reallocs exposed as a primitive, and then potentially TbiBox built on top of that (or that could already be done in a user crate).
raw pointer ABI for these reallocs exposed as a primitive
Good idea for sure, agreed!
Feature gate:
#![feature(stdarch_aarch64_mte)]
This is a tracking issue for AArch64 MTE memory tagging intrinsics.
Public API
Steps / History
Unresolved Questions