Identify the next set of RISCV-64 exemplars

thixotropist commented 9 months ago

Determine the next set of exemplars - and design questions - to tackle next. The meta-issue involves guessing where new Ghidra capability in analyzing RISCV-64 evolution will make a difference, say in 2026 when higher performance RISCV-64 cores appear in network appliances. We'll continue to prioritize network over machine learning and general server extensions, so things like half-precision floating point will get less attention than topics like fast-hashing or advanced management of many-core access to IO memory and cache hierarchies. Exemplars that resemble a RISCV-64 alternative to Xtensa or Cavium Network Processing Unit designs are good, if we can find any.

Possible topics to explore next include:

Existing instructions supported by the development tip of binutils not already handled in the Ghidra.
Existing instructions found within current Fedora and Ubuntu kernels. There are only a few of these seen to date, apparently dealing with hypervisor cache management.
RISCV builtins added recently to GCC 14 header files.
RISCV instruction extensions added recently to the libssl
RISCV instruction extensions added to libc, for example implementing faster versions of memmov and strncpy.

Possible priorities include:

Exemplars that help develop pcode semantics that capture the coder's intent or align with GCC or libc conventions.
Exemplars that are likely to be inlined into other code, disrupting Ghidra disassembly and analysis

thixotropist commented 9 months ago

Let's start with the hardest long-term topic, then fill in with the easier material when we hit the inevitable brick wall.

The next version of GCC, gcc-14, apparently adds support for both RISCV intrinsic functions and for auto-vectorization of loops. That suggests we are likely to see vector extension functions appearing in many more places starting mid to late 2024. Ghidra's decompiler simply stops analysis of a function when it hits its first vector instruction, so the first goal involves:

identify a sample set of GCC-14 vector intrinsic functions likely to be found in the wild. The site https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/main/examples looks promising for this.
import a GCC-14 toolchain sufficient to compile these intrinsic examples into object files Ghidra can import
add semantics to the Ghidra RISCV sleigh files to give the user a hint about what the object code is doing. Recognizing the vector version of memcpy and strncpy would be good immediate goals.

That sounds easy enough, but the RISCV vector intrinsic functions are built on all possible vector instructions multiplied by all possible vector configurations, apparently resulting in some 28,000 builtin function signatures. That's so many there is no C header file holding them - a generated riscv_vector.h is compiled into an intermediate representation when the compiler is built, then injected directly into a running GCC-14 instance if and when riscv_vector.h is imported.

A complete Ghidra solution might involve generating a large set of vector modes, tracking changes to those modes through logic branches, then picking the matching vector instrinsic function pcode operation for each vector instruction. That's far too complicated a solution for today.

Instead, let's simply look to add vector pcode semantics to RISCV sleigh files as we find vector instructions in exemplars. We won't try to completely capture vector state, just give a human user enough context to identify basic vector operations within larger programs.

This project will have a branching-tree structure, as we explore the contexts in which instruction set extensions might appear in contexts relevant to Ghidra. There will be plenty of backtracking, so we need good living documents on where we find RISCV instruction set extensions and what Ghidra can do to help understand them.

thixotropist commented 9 months ago

The last few commits added a few riscv vector intrinsics, the kinds one might find inside of a libc or matrix math library. Now we want to add a bit more specificity, looking for exemplars that a Ghidra user might run across.

For example:

We expect vector instructions to appear in optimized versions of memcpy and strcpy. Will Ghidra still be helpful in identifying 'buffer blasting' vulnerabilities, or will analysis fail when it reaches the first vector load instruction?
Network appliances don't do a lot of math other than atomic addition of counters. The the vector gather/scatter instructions would be useful, as well as hash and search instructions. Will Ghidra be helpful in vetting network device drivers using such instructions?
Voice recognition code may become common in future appliances. Last month saw riscv vector instrinsic support added to Gerganov's ggml library. Searching for vulnerabilities within code like that would definitely be a challenge

thixotropist commented 8 months ago

Another set of exemplars depends on gcc-14 or later releases - riscv binaries compiled with autovectorization enabled. The gcc testsuite files under gcc/testsuite/gcc.target/riscv/rvv/autovec/gather-scatter would make decent examples. Note that this requires compilation flags like -march=rv64gcv_zvfh -mabi=lp64d -O3 --param riscv-autovec-preference=scalable -fno-vect-cost-model -ffast-math. These additional exemplars should probably get deferred until closer to gcc-14 release, maybe mid 2024.

We still need to find riscv vector exemplars that make sense in network appliances. Vector implementations of AES-GCM would be useful. vector algorithms for hash table and tree map lookups would be especially nice, as they could be used in sessionization of inbound IP and MPLS packets. Vector instructions might improve throughput of some operations by a factor of 2 or 3, but they won't automagically fix memory bandwidth limits inherent in a riscv system. Vector solutions will likely increase latency while enabling higher throughput, a tradeoff that depends on the application.

thixotropist commented 8 months ago

Commit 1da6a56c5 adds some of the THead vendor-specific ISA extensions described in https://github.com/T-head-Semi/thead-extension-spec/releases/download/2.0.0/xthead-2022-09-05-2.0.0.pdf and implemented in binutils 2-41.

thixotropist commented 8 months ago

Consider adding 32 bit RISCV exemplars. This may include:

small controller devices like the BL602 which is claimed to be built around a RISC-V RV32IMAFC SiFive E24 core.
hybrid devices combining 64 bit RISCV cores with 32 bit RISCV cores, where the 32 bit cores service peripheral devices and the 64 bit cores provide administration or analytic support.

It would be especially nice to find an examplar showing how 64 bit and 32 bit cores might communicate with one another.

thixotropist commented 6 months ago

What will Ghidra do with vectorized loops on processors with RISCV-64 vector extensions? We can find out by adding exemplars drawn from the unreleased GCC-14 RISCV-64 vectorization test suite. For that we need a toolchain and platform based on what we expect to see in mid 2024, when GCC-14 should be released. This platform will mutate quickly, based on the development tip of GCC, binutils, and glibc - all cast into a Bazel 7.0 build environment.

thixotropist commented 6 months ago

Add some rust binaries, compiled with both llvm and gcc, and ideally for both x86_64 and riscv-64 processors. An initial exercise might involve matching rust strings with logging and assertion calls, for instance:

assert_eq!(
        expected_count,
        2,
        "Resource counter should be two at this point"
log::info!("Done with primitives")
log::info!(
        "bad_result.is_err(): {:?}",
        bad_result.is_err()

thixotropist commented 6 months ago

The gccrs rust compiler can't handle macros from std yet, so defer rust exemplars.

The gcc-14 developmental toolchain is in place. We probably want to add simple C exemplars in groups to show:

compilation variances with alderlake, rocketlake, and sapphirerapids microarchitectures with their different vector extensions
with and without link time optimization
with and without C++ modules and precompiled headers.

thixotropist commented 6 months ago

The path forward for Ghidra imports should track the path taken by GCC toolchains. There isn't much point in getting too far ahead of what the compiler typically does. This suggests:

Add exemplars from the GCC compiler toolchain test suite in roughly the order those tests appear. That appears to mean vector optimizations of memory copy operations comes first, with simple loop vectorization second and string and bit manipulation third.
There's no reason to get too arcane with optimization - use -O2 instead of something more detailed. Use -march=x86-64-v3 instead of -march=sapphirerapids if we need an Intel reference.
If we need more Ghidra pcode operations, use generic ones used in GCC RTL intermediate instruction generation.
Pay the most attention to inlined code, not optimized code likely found in libc. For example, memcpy is much more likely to be inlined than strncmp, so we should be able to recognize the 6 or so different instruction sequences generated by gcc when it expands its cpymem RTL instruction.

So the next set of exemplars will be built around GCC testsuite code validating memcpy-like operations, showing the two types of C source code most likely to be translated into cpymem RTL instructions and the 6 or so instruction patterns generated with the various RISCV march common options. The result should be a human-readable document helping Ghidra users recognize inlined cpymem expansions.

thixotropist commented 5 months ago

The next set of exemplars cover short term, medium term, and longer term issues. We want to understand how soon extension-based optimizations will be a problem for Ghidra, and how much time it might take to work up semantic hints that may address those problems.

The memory copy inline optimizations will probably land soonest, and we have exemplars of those. These look fairly easy to recognize, with vset instructions followed by vector load and store instructions, possibly within a short loop. These vector instructions show more diversity than one might expect, possibly to replicate scalar alignment exception handling. Ghidra users likely don't need to care about that diversity
Simple loop vectorization will likely land next, so we should import some of the gcc loop test cases to see what they look like. The gcc test suite includes binary and ternary loop examples, so we want a few examples of each.
In the medium term we want to know when autovectorization will impact our ability to understand more general code. Tree and hashmap implementations are common and likely to be non-trivial to vectorize. If gcc-14 leaves standard map implementations mostly unchanged, then we can punt Ghidra recognition of vector optimizations further into the future. If not, then we need to know where the grief will hit us first.
The longer term involves applying Ghidra to look for malware in machine learning and large language model cases. The Whisper CC open source code for voice-to-text should be a good exemplar there. This kind of code is hard to understand in source code form, with vector optimizations likely to make it very difficult to understand semantics from distributed binaries.

If possible, we should develop each of these binary exemplars for a vanilla gcc-14 -O2 optimization, gcc-14 vector + bit manipulation optimization extensions, and a gcc-14 vector + bit manipulation + THead extensions.

thixotropist commented 5 months ago

Close this as we have enough exemplars to go forward.. Tests show some minor gaps in the Ghidra 11 isa_ext branch.

thixotropist / ghidra_import_tests

Identify the next set of RISCV-64 exemplars #6