thixotropist / ghidra_import_tests

Experimental framework for testing Ghidra binary import support
1 stars 0 forks source link

Identify the next set of RISCV-64 exemplars #6

Closed thixotropist closed 5 months ago

thixotropist commented 9 months ago

Determine the next set of exemplars - and design questions - to tackle next. The meta-issue involves guessing where new Ghidra capability in analyzing RISCV-64 evolution will make a difference, say in 2026 when higher performance RISCV-64 cores appear in network appliances. We'll continue to prioritize network over machine learning and general server extensions, so things like half-precision floating point will get less attention than topics like fast-hashing or advanced management of many-core access to IO memory and cache hierarchies. Exemplars that resemble a RISCV-64 alternative to Xtensa or Cavium Network Processing Unit designs are good, if we can find any.

Possible topics to explore next include:

Possible priorities include:

thixotropist commented 9 months ago

Let's start with the hardest long-term topic, then fill in with the easier material when we hit the inevitable brick wall.

The next version of GCC, gcc-14, apparently adds support for both RISCV intrinsic functions and for auto-vectorization of loops. That suggests we are likely to see vector extension functions appearing in many more places starting mid to late 2024. Ghidra's decompiler simply stops analysis of a function when it hits its first vector instruction, so the first goal involves:

  1. identify a sample set of GCC-14 vector intrinsic functions likely to be found in the wild. The site https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/main/examples looks promising for this.
  2. import a GCC-14 toolchain sufficient to compile these intrinsic examples into object files Ghidra can import
  3. add semantics to the Ghidra RISCV sleigh files to give the user a hint about what the object code is doing. Recognizing the vector version of memcpy and strncpy would be good immediate goals.

That sounds easy enough, but the RISCV vector intrinsic functions are built on all possible vector instructions multiplied by all possible vector configurations, apparently resulting in some 28,000 builtin function signatures. That's so many there is no C header file holding them - a generated riscv_vector.h is compiled into an intermediate representation when the compiler is built, then injected directly into a running GCC-14 instance if and when riscv_vector.h is imported.

A complete Ghidra solution might involve generating a large set of vector modes, tracking changes to those modes through logic branches, then picking the matching vector instrinsic function pcode operation for each vector instruction. That's far too complicated a solution for today.

Instead, let's simply look to add vector pcode semantics to RISCV sleigh files as we find vector instructions in exemplars. We won't try to completely capture vector state, just give a human user enough context to identify basic vector operations within larger programs.

This project will have a branching-tree structure, as we explore the contexts in which instruction set extensions might appear in contexts relevant to Ghidra. There will be plenty of backtracking, so we need good living documents on where we find RISCV instruction set extensions and what Ghidra can do to help understand them.

thixotropist commented 9 months ago

The last few commits added a few riscv vector intrinsics, the kinds one might find inside of a libc or matrix math library. Now we want to add a bit more specificity, looking for exemplars that a Ghidra user might run across.

For example:

thixotropist commented 8 months ago

Another set of exemplars depends on gcc-14 or later releases - riscv binaries compiled with autovectorization enabled. The gcc testsuite files under gcc/testsuite/gcc.target/riscv/rvv/autovec/gather-scatter would make decent examples. Note that this requires compilation flags like -march=rv64gcv_zvfh -mabi=lp64d -O3 --param riscv-autovec-preference=scalable -fno-vect-cost-model -ffast-math. These additional exemplars should probably get deferred until closer to gcc-14 release, maybe mid 2024.

We still need to find riscv vector exemplars that make sense in network appliances. Vector implementations of AES-GCM would be useful. vector algorithms for hash table and tree map lookups would be especially nice, as they could be used in sessionization of inbound IP and MPLS packets. Vector instructions might improve throughput of some operations by a factor of 2 or 3, but they won't automagically fix memory bandwidth limits inherent in a riscv system. Vector solutions will likely increase latency while enabling higher throughput, a tradeoff that depends on the application.

thixotropist commented 8 months ago

Commit 1da6a56c5 adds some of the THead vendor-specific ISA extensions described in https://github.com/T-head-Semi/thead-extension-spec/releases/download/2.0.0/xthead-2022-09-05-2.0.0.pdf and implemented in binutils 2-41.

thixotropist commented 8 months ago

Consider adding 32 bit RISCV exemplars. This may include:

It would be especially nice to find an examplar showing how 64 bit and 32 bit cores might communicate with one another.

thixotropist commented 6 months ago

What will Ghidra do with vectorized loops on processors with RISCV-64 vector extensions? We can find out by adding exemplars drawn from the unreleased GCC-14 RISCV-64 vectorization test suite. For that we need a toolchain and platform based on what we expect to see in mid 2024, when GCC-14 should be released. This platform will mutate quickly, based on the development tip of GCC, binutils, and glibc - all cast into a Bazel 7.0 build environment.

thixotropist commented 6 months ago

Add some rust binaries, compiled with both llvm and gcc, and ideally for both x86_64 and riscv-64 processors. An initial exercise might involve matching rust strings with logging and assertion calls, for instance:

assert_eq!(
        expected_count,
        2,
        "Resource counter should be two at this point"
log::info!("Done with primitives")
log::info!(
        "bad_result.is_err(): {:?}",
        bad_result.is_err()
thixotropist commented 6 months ago

The gccrs rust compiler can't handle macros from std yet, so defer rust exemplars.

The gcc-14 developmental toolchain is in place. We probably want to add simple C exemplars in groups to show:

thixotropist commented 6 months ago

The path forward for Ghidra imports should track the path taken by GCC toolchains. There isn't much point in getting too far ahead of what the compiler typically does. This suggests:

So the next set of exemplars will be built around GCC testsuite code validating memcpy-like operations, showing the two types of C source code most likely to be translated into cpymem RTL instructions and the 6 or so instruction patterns generated with the various RISCV march common options. The result should be a human-readable document helping Ghidra users recognize inlined cpymem expansions.

thixotropist commented 5 months ago

The next set of exemplars cover short term, medium term, and longer term issues. We want to understand how soon extension-based optimizations will be a problem for Ghidra, and how much time it might take to work up semantic hints that may address those problems.

If possible, we should develop each of these binary exemplars for a vanilla gcc-14 -O2 optimization, gcc-14 vector + bit manipulation optimization extensions, and a gcc-14 vector + bit manipulation + THead extensions.

thixotropist commented 5 months ago

Close this as we have enough exemplars to go forward.. Tests show some minor gaps in the Ghidra 11 isa_ext branch.