AArch64 support status and issues

yuyichao commented 2 years ago

rr status

~~Testing on my M1 MBA, there are currently < 30 test failures out of 1311 (40 with syscallbuf, see below).~~ The main missing piece from within rr is the syscallbuf. It's actually quite tricky to implement in a way that satisfies all the requirement we have on x86 (in particular, that it should work without a valid stack....). I have a write up and a WIP implementation on this and I'll post a draft PR here after some more clean ups. All tests passes on apple-m1, neoverse-n1/v1, cortex-a77. Syscallbuf is implemented.
Supported hardware

Currently, we support arm-neoverse-n1 and apple-m1. It seems that most of the recent arm cores up to cortex-a78 should also be supported without much issue (a55, a65, a65ae, a75-a78). I assume the upcoming apple-m2 should also work fine as well assuming it's apple-a15 based.
Kernel features required

x86 currently implements three features (that I can tell) that isn't generally implementable on aarch64 without additional kernel support.
1. Unbound CPU. This should work on aarch64 if there's a single PMU type, (or we bound it to the cores with the same PMU type). Supporting migration between PMU types would likely require kernel support due to the need for interrupt. I kind of doubt they are willing to add this but someone else with more kernel experience should bring this up...
2. CPUID. The traditional way on aarch64 to figure out processor features and IDs is AUXV and procfs/sysfs. These should all be handled well from RR since these are normal kernel software interfaces. Recent kernel versions, however, support emulating the mrs instructions that reads the EL1 cpuid registers and AFAICT doesn't include a way for ptracer to catch it yet.
3. Time register. Like RDTSC on x86, aarch64 has system registers like CNTVCT_EL0 that can be used as counters. (There are a few other related ones as well). There doesn't seem to be a way to trap on these from userspace ATM but at least from the architecture manual for this there should be a way for the kernel to trap this.
SVE/armv9-a

SVE has a feature that I have always been worrying regarding predictability ever since it comes out. To make it easier to vectorize code with complex loop termination condition, SVE has introduced the first fault (FF) and non-fault (NF) versions of the load instructions. When accessing invalid memory with these, instead of producing a fault, these simply set a mask indicating the fault. Clever use of this would then allow vectorization of string functions (e.g. strlen) since one can perform out-of-bound read without any visible consequences.

The issue I saw with this is that it depends on the OS paging. Even if a page is mapped from the userspace point of view, it may not actually be mapped and depend on how the kernel feel like being lazy or not. This was previously completely transparent to the userspace but now with the SVE instructions, one can in principle observe these and it is therefore something that rr has to keep track of/manage.

It also seems that this could be worse. While we can in principle track and record what the kernel does. The arm ISA document says that
```
Implementation may suppress NF load for any reason
```
(Search for MemSingleNF). The exact behavior here is of course implementation dependent and it's of course possible that the vendors are quite reasonable here. However, that's something that at least need to be tested.

This is relevant for any processor with SVE. The fujitsu-a64fx is probably the one with the highest hope of being able to run rr at the moment (Their PMU document doesn't mention the counter we use but the numbering of the rest agrees with the ARM PMU document so I think one need to just check if the ones we use are implemented...). This is likely going to matter more in the future since SVE and SVE2 are part of the armv9-a requirement and all future ARM processors starting from a510/a710, including neoverse-n2 and neoverse-v1 will have them (neoverse-n2 and v1 are not armv9 but n2 has SVE and v1 has SVE/SVE2). It's also perceivable that a distro would release a new version for armv9-a and binaries in it could be compiled with SVE turned out at compile time so masking off the feature may or may not work at that time...

khuey commented 2 years ago

* Kernel features required
  x86 currently implements three features (that I can tell) that isn't generally implementable on aarch64 without additional kernel support.

  1. Unbound CPU. This should work on aarch64 if there's a single PMU type, (or we bound it to the cores with the same PMU type). Supporting migration between PMU types would likely require kernel support due to the need for interrupt. I kind of doubt they are willing to add this but someone else with more kernel experience should bring this up...

This is not important IMO.

  2. CPUID. The traditional way on aarch64 to figure out processor features and IDs is AUXV and procfs/sysfs. These should all be handled well from RR since these are normal kernel software interfaces. Recent kernel versions, however, support emulating the `msr` instructions that reads the EL1 cpuid registers and AFAICT doesn't include a way for ptracer to catch it yet.

The mrs emulation code in arch/arm64/kernel/cpufeature.c needs to obey ARCH_SET_CPUID. This is going to be a pain because they made us put it in arch_prctl on x86 and arm64 doesn't have arch_prctl.

  3. Time register. Like RDTSC on x86, aarch64 has system registers like CNTVCT_EL0 that can be used as counters. (There are a few other related ones as well). There doesn't seem to be a way to trap on these  from userspace ATM but at least from the architecture manual for this there should be a way for the kernel to trap this.

Similarly, cntvct_read_handler should obey PR_SET_TSC.

* SVE/armv9-a
  SVE has a feature that I have always been worrying regarding predictability ever since it comes out. To make it easier to vectorize code with complex loop termination condition, SVE has introduced the first fault (FF) and non-fault (NF) versions of the load instructions. When accessing invalid memory with these, instead of producing a fault, these simply set a mask indicating the fault. Clever use of this would then allow vectorization of string functions (e.g. strlen) since one can perform out-of-bound read without any visible consequences.

Ugh, so they made page faults user visible? What a nightmare.

yuyichao commented 2 years ago

Unbound CPU.

This is not important IMO.

This is mainly annoy when you got randomly pinned to a E core and then got stuck there for the rest of the run even if the P core are free to use... I guess some option/default to prefer higher performance core might offset this issue.

and arm64 doesn't have arch_prctl.

So looking at the document for arch_prctl, is it a thing on 32bit x86 before 4.12? It seems that the document still says it's x64 only, even though it also says that SET|GET_CPUID works on x86...

Similarly, cntvct_read_handler should obey PR_SET_TSC.

Huh, is reading these values always trapping on linux? The ARM description of this register seems to have a branch that doesn't trap in EL0 and I thought it must be what's the kernel is doing for better performance...

Ugh, so they made page faults user visible? What a nightmare.

I don't think I understand the full impact of this myself but for the record, I did ask about this right after SVE came out ..... I'm honestly not sure if the kernel or the hardware part would be harder to deal with...

yuyichao commented 2 years ago

is it a thing on 32bit x86 before 4.12

So it does seem that you added it to x86-32 (https://github.com/torvalds/linux/commit/79170fda313ed5be2394f87aa2a00d597f8ed4a1, and given the large syscall number I guess I should have guessed...) so at least it should be somewhat similar in this regard?

khuey commented 2 years ago

Unbound CPU.

This is not important IMO.

This is mainly annoy when you got randomly pinned to a E core and then got stuck there for the rest of the run even if the P core are free to use... I guess some option/default to prefer higher performance core might offset this issue.

Perhaps we should only ever pin to P cores?

and arm64 doesn't have arch_prctl.

So looking at the document for arch_prctl, is it a thing on 32bit x86 before 4.12? It seems that the document still says it's x64 only, even though it also says that SET|GET_CPUID works on x86...

Yeah, the manpages are wrong about it being amd64 only.

Similarly, cntvct_read_handler should obey PR_SET_TSC.

Huh, is reading these values always trapping on linux? The ARM description of this register seems to have a branch that doesn't trap in EL0 and I thought it must be what's the kernel is doing for better performance...

I believe so.

Ugh, so they made page faults user visible? What a nightmare.

I don't think I understand the full impact of this myself but for the record, I did ask about this right after SVE came out ..... I'm honestly not sure if the kernel or the hardware part would be harder to deal with...

I don't see how we could deal with it. Recreating the precise state of what is paged in or out is not possible.

yuyichao commented 2 years ago

Perhaps we should only ever pin to P cores?

But then you do need to figure out what to do when there's one cortex-x1, three cortex-a78 and 4 cortex-a55.. = = ..... For running parallel tests (rr, or else) it is also not particularly nice..

I don't see how we could deal with it. Recreating the precise state of what is paged in or out is not possible.

I assume this info is available in the kernel so at least it is in principle possible with some kernel patch. Even without it, I think we could ask the kernel to pin all the pages (e.g. with mlock) though that would require fairly heavy instrumentation on all the mmap, mprotect, madvice calls etc and would require some care to not pinning all the COW zero pages. I do feel like this is something that the virtualization people care about (and in some sense what I just mentioned is essentially doing userspace page management...) so there's a chance that we could benefit from whatever they've added to the kernel...

rocallahan commented 2 years ago

I don't think mlocking everything is feasible.

I assume this info is available in the kernel so at least it is in principle possible with some kernel patch.

How would we fix this in the kernel? It sounds to me like these SVE instructions don't actually generate page faults?

yuyichao commented 2 years ago

How would we fix this in the kernel

I mean to make sure the paging for the recording and replaying are identical.

rocallahan commented 2 years ago

I actually don't understand how these instructions are supposed to work in practice, if they never trigger page-in. Do you have to try the non-faulting instruction and if you don't get any valid data, retry with a faulting instruction?

rocallahan commented 2 years ago

Or do they trigger faults for the first byte but not the rest, or something like that?

yuyichao commented 2 years ago

I have never got my hand on actual hardware so I'm not 100% sure, but my understanding is that you would use them in the following pattern.

Do a first-fault load
If there is a fault on any but the first element, instead of raising an exception, the bits in a mask register will be cleared, say it'll be [1, 0, 0, 0] if the first one is valid and the rest faults
Do the rest of the vectorized loop using this mask as the base and forward the loop counter accordingly
Check loop termination condition and loop again.

In the example above, on the next iteration, the first element would be the second element in the previous loop, and it will trigger a real fault and the kernel will do whatever it need to do to handle that, either page in some memory or send a signal.

There is also non-fault instructions and although I haven't really seen any explicit document on how they should be used, I assume they'll be used for loop unrolling, i.e. you can load more than one SVE registers per loop. The first one would use first-fault and the rest would use non-fault.

yuyichao commented 2 years ago

Assuming that the fault doesn't happen often, I feel like the simplest way to deal with this is binary instrumentation (which can also be used to record ll-sc)... For nf or ff SVE load, we could simply replace it with a normal load and catch the segfault.

It might also be possible to just record the value of ffr (the register that returns which element has faulted) but doing that without a usable pointer to memory is going to be difficult. From the code examples ARM has posted in various presentations, it seems that there may be many SVE loops that contains virtually no temporary GP registers that we can overwrite.

khuey commented 2 years ago

Assuming that the fault doesn't happen often, I feel like the simplest way to deal with this is binary instrumentation (which can also be used to record ll-sc)... For nf or ff SVE load, we could simply replace it with a normal load and catch the segfault.

Adding binary instrumentation would be a radical change to rr's architecture and one that I don't think we would take.

rocallahan commented 2 years ago

We've made a lot of tradeoffs to avoid requiring full binary instrumentation during recording. That has benefited us by giving us lower single-threaded recording overhead, and a simpler and more maintainable design that doesn't require work for every new instruction as architectures evolve. I think robustly handling all the stuff rr currently handles (e.g. signals, sandboxes, exotic clone() options) while binary instrumentation runs in the recorded tracees would also be pretty complex.

Performing full binary instrumentation during recording is not crazy --- UndoDB does it AFAIK --- and would let us choose very different tradeoffs, but this means ultimately you'd want a very different design. E.g. the way we handle CPUID and RDTSC, the way we handle syscalls, maybe even the way we (don't) handle multiple cores would probably all end up in a very different place. I think we'd probably want to rearchitect rr from the ground up, perhaps reusing some of the existing code. It would be a fun project to work on but it's not something I want to work on right now.

Keno commented 2 years ago

Re the SVE thing. Can we talk to ARM about documenting a mode where those non-/(first-) faulting instructions are turned into regular loads? I believe all chips that supports these SVE instructions have hypervisor accessible patch registers that can change the instructions. It might require some convincing, but is should be technically possible.

yuyichao commented 2 years ago

I think we'd probably want to rearchitect rr from the ground up, perhaps reusing some of the existing code.

I was mainly thinking if it's possible to do that with minimum refactoring. I was hoping that it these should have minimum interaction with the rest of the code but I'm of course not sure...

I believe all chips that supports these SVE instructions have hypervisor accessible patch registers that can change the instructions

You mean some registers that changes specific instructions? Or is it something more generic?

Keno commented 2 years ago

You mean some registers that changes specific instructions?

Yes:

https://git.lumina-sensum.com/LuminaSensum/arm-trusted-firmware/blob/942013e1dd57429432cd71cfe121d702e3c52465/lib/cpus/aarch64/neoverse_n1.S#L53-L61

yuyichao commented 2 years ago

K, so I assume pretty much the chicken bits. Though these particular ones seems to be documented as should be enabled before mmu…..

Theoretically, if there’s a way to have the hardware trap any instruction that we can’t handle (e.g. stxr) or under a condition that we can’t handle (e.g. ldff) then it should be totally fine of course. I’ve just personally never have any experience of convincing multiple vendors to be on board to implement a new feature so far ….

Manouchehri commented 2 years ago

@yuyichao On your M1, what is your operating system setup? Linux on bare metal, or in a VM under macOS?

Keno commented 2 years ago

Only bare metal is supported. Apple does not expose the performance counters in VMs.

GitMensch commented 1 year ago

The website says:

requires a reasonably modern x86 CPU or certain ARM CPUs (Apple M1+)

So - are all/some of the issues mentioned here solved now or are those only related to "other ARM CPUs"?

rocallahan commented 1 year ago

Let's leave this open to track these ARM features we might need in the future.

DemiMarie commented 8 months ago

Only bare metal is supported. Apple does not expose the performance counters in VMs.

Is bare metal reasonably usable for browsers? rr allows sandbox escapes, so I don’t know if it is reasonable to use it for a browser that is going to be accessing untrusted web content, which is the usual case.

DemiMarie commented 8 months ago

@rocallahan I can say that in my experience (developer of @QubesOS), there are times when I wanted to use rr, but I was never able to do so because of the performance counter requirement. Xen doesn’t expose performance counters in VMs, and it is a type 1 hypervisor so everything is a VM. Furthermore, rr allows sandbox escapes, so using it for web browsers accessing untrusted web content is ill-advised outside of a test system.

rocallahan commented 8 months ago

This really belongs in another thread. Filed https://github.com/rr-debugger/rr/issues/3705

victorldn commented 2 months ago

May I ask whether there is any ongoing interest SVE-support enablement? Or have the issues associated with handling first-faulting loads made this a no-go?

DemiMarie commented 2 months ago

I think binary instrumentation and mlockall() are the only viable options.

rr-debugger / rr

AArch64 support status and issues #3234