Recast project goals and scope

This project seeks a set of Ghidra import regression tests to validate sensible behavior after importing executable binaries into new versions of Ghidra. It's morphed somewhat into generating newer executable binaries that Ghidra can not currently import sensibly, but might in some future Ghidra release.

Some of those newer binaries might never be sensibly imported into Ghidra's more advanced features like decompilation into compilable C and dynamic analysis/emulation.

Two examples from the RISC-V processor space:

Vector instruction set extensions allow optimizing compilers to significantly restructure loops and memory accesses to take advantage of greatly expanded register space, parallel execution, and loop unrolling. For Vector Length Agnostic architectures like RISC-V, this means element type and length information is carried in context registers rather than encoded into individual instructions. The Qemu emulator uses about 25K lines of code to track that context and emulate vector instructions. These semantics are not likely to be integrated into Ghidra's RISC-V Sleigh definition any time soon. Type inference of vector instruction results is likely to be especially difficult, making it impossible to select the proper RISC-V vector intrinsic function to present in the Ghidra decompiler window.
Much of the RISC-V processor development appears to be aimed at Machine Learning or image recognition/surveillance applications, where 16 bit floating point operations, with and without saturation, are common. Ghidra has no current support for emulating either of the two common 16 bit floating point formats currently supported by GCC.

Let's try recasting this project to explore feasible Ghidra integration tests. Assume we have two RISCV-64 executables built from DPDK and Whisper.cpp sources and compiled for the Sophgo 2380 processor. What features - and feature tests - would we need to add to Ghidra to support static analysis? If we need dynamic analysis, do we look to Ghidra for that or do we rely on a RISCV-64 VM or Qemu?

Start with a subjective Ghidra integration test, where we have a training case study analyzing a misbehaving RISC-V network appliance built with the latest GCC toolchain and microarchitectures. A successful Ghidra feature will make that case study easier to follow, for a modest development effort.

For example, adding Sleigh definitions and user pcode operations for vector instructions allows the disassembler and decompiler to extend static analysis to more functions. This can give the user a much clearer perspective on internal operations.

Those vector instructions are also likely to confuse the user's perspective, as they often replace multiple simple scalar operations with fewer but more complex vector instructions. Can Ghidra do something to help reduce that confusion, or do we rely on user training aids to recognize common vector instruction sequences?

We will start with a RISC-V network application built on a DPDK framework, believed to be similar to the dpdk-ip_pipeline example. The inputs to Ghidra will be reference builds of dpdk-ip_pipeline with and without symbols stripped, plus a realtime snapshot of dpdk-ip_pipeline after initialization within a RISC-V emulation environment. That probably needs a separate project repo.

thixotropist / ghidra_import_tests

Recast project goals and scope #25