Open DrChat opened 2 years ago
hmm, i was thinking of a bit of a different approach: as you've noted, Operand
and Opcode
generally are very similar, so i think we could actually replace explicitly writing out the Opcode
enum with a bit of codegen that also handles laying out the string arrays over in */display.rs
. that, i think, would also let us assign explicit values to the Opcode
variants so yaxpeax-x86
could have data structures laid out like...
const MNEMONICS: &'[&'static str] = &[
"add",
"sub",
"aaa", // only 16- and 32-bit `Opcode` reference this
"aas", // only 16- and 32-bit `Opcode` reference this
"movsx", // only 64-bit `Opcode` references this
"mov", // used in all modes
];
mod real_mode {
enum Opcode {
ADD = 0,
SUB = 1,
AAA = 2,
AAS = 3,
MOV = 5,
...
}
}
mod long_mode {
enum Opcode {
ADD = 0,
SUB = 1,
MOVSX = 4,
MOV = 5,
...
}
}
mod quasi_x86_name_pending {
// note that _this_ `Opcode` has the same integer values for each variant, so a conversion to this opcode can be just a transmute
enum Opcode {
ADD = 0,
SUB = 1,
AAA = 2,
AAS = 3,
MOVSX = 4,
MOV = 5,
...
}
}
where this could get generated from a table like
add=all,
sub=all,
aaa=16,32
aas=16,32
movsx=64
mov=all,
...
spitballing, i really haven't thought about the table layout in particular. this could let us generate the Colorize
impl too, which is just kinda gross to maintain.
this is trickier for Operand
since i don't think we can guarantee layout-compatibility for those. but with something automated handling Opcode
and a bit of elbow grease around Operand
, i think then we could have a
mod quasi_x86_name_pending {
/// an "arch" for a pseudo-x86 - a best-effort superset of 16-, 32-, and 64-bit x86
pub struct Arch;
impl yaxpeax_arch::Arch for Arch {
// same idea as other modes, but with the superset versions of `Opcode` and `Operand`
}
struct SupersetDecoderNamePending {
x86_16: yaxpeax_x86::real_mode::InstDecoder,
x86_32: yaxpeax_x86::protected_mode::InstDecoder,
x86_64: yaxpeax_x86::long_mode::InstDecoder,
current_mode: EnumToSelectWhichDecoder
}
impl Decoder<Arch> for SupersetDecoderNamePending {
fn decode<...>(&self, words: ...) -> Result<Instruction, DecodeError> {
match self.current_mode {
x86_16 => self.x86_16.decode(words).map(|inst| inst.into_superset_form())
...
}
}
}
}
this would require functions to transform an arch-specific instruction into the common-x86 form, but that fills almost the same niche as your X86Instruction
trait (though a bit differently). i think it fits the same for uses like yours, but avoids kinda-shared kinda-distinct data for uses where distinguishing between modes is more desired?
If we are going to do some codegen, I'd highly recommend using a standard format like json (if possible) so others can use that data as well.
Hmm - such a table that links Opcode
to the MNEMONICS
array could likely be generated with a procedural macro.
It'd be somewhat difficult to generate the mode-specific tables without resorting to a custom solution (such as a build.rs
script or a proc-macro crate).
I do like the idea of having all Opcode
enums be represented by an integral value that is the same for every unique instruction.
Let me think some more and see if I can't work my way towards what you've suggested.
Just to write this idea down to save for later - we could probably implement Opcode
subsets with a special proc-macro:
#[repr(usize)] // Required to define the layout
enum Opcode {
ADD,
AAA,
AAS,
SUB,
MOV,
MOVSX,
}
mod long_mode {
#[superset="super::Opcode"]
#[repr(usize)] // Required to define the layout
enum Opcode {
ADD,
SUB,
MOV,
MOVSX,
INC, // ERROR: Enum variant is not specified in superset enum (as an example)
}
// Generated by proc-macro
impl Opcode {
pub fn to_superset(&self) -> super::Opcode {
// SAFETY: Guaranteed to be safe, as superset implements all variants of this subset.
unsafe { core::mem::transmute(self) }
}
pub fn from_superset(enum: super::Opcode) -> Option<Self> {
todo!()
}
}
}
Such a macro would define subset variants to be equivalent to their superset variants (for 1:1 conversion or direct casting in the case of going from a subset to a superset).
ah! i was wondering if you'd made progress on this or put it aside. is there already a proc macro for superset
or would you have to write that too? the tricky thing here is that if there are holes in the subset enums you'd need to either pick matching underlying values for all variants or make to_superset
a bit more complicated (otherwise the transmute might map f.ex long_mode::Opcode::INC
to ::Opcode::MOV
! no good)
my thought was to list out the whole deal in a table (json like @i509VCB mentioned would make sense) and generate off of that, with the light benefit that we wouldn't have ~6k lines of enum variants anymore :sunglasses:
anyawy, if you're planning on putting this down, i might give that idea a try in the next few weeks.
I made more progress - but in the interest of expediency, I've only made progress that directly impacts my project (changes here). And you raise a good point - my thought was to make the subset enums declare values that are equivalent to their superset variants, i.e.
#[superset="super::Opcode"]
#[repr(usize)]
enum Opcode {
ADD = super::Opcode::ADD,
// ...
}
Done implicitly by the macro, of course. I may put some time towards it, but definitely feel free to give it a shot if you're feeling it as well!
in case you're still watching this, i did finally give this a shot - https://github.com/iximeow/yaxpeax-x86/commit/354df90573693ca70de72705b6a77b4e02b53f01 is the current (still not a full change set) approach. this adds a new x86_generic
where the specific modes can be converted up to the generic one. then for Opcode
, the most verbose of all this stuff, Display
, Colorize
, mnemonics, etc, are implemented in terms of the generic Opcode
. it also comes with (currently architecture-specific, not sure it has to be) codegen for the "decoding as if this is a specific microarchitecture" feature. that being duplicated for each mode is a pretty substantial portion of the code that's in that diff.
(then there's a fair question of "why generate it with python instead of a proc macro or build.rs?", and the answer is a moral opposition to build-time codegen if it's not necessary. debugging a proc macro is really annoying and i don't like asking people to run build.rs scripts. so, generate when it's updated and commit it. very gopher brain of me. sorry to the rustaceans.)
there's a bit more on top of this commit that i've yet to get to a point i want to push, but i'm convinced that this gets us to a point where yaxpeax-x86
has a useful generic "just try your best" mode without being too much overhead.
i also have a sneaking suspicion that even with the extra source lines, this might reduce the total resulting size of the compiled crate with more than one architecture included. with Opcode
unified as it is, there might even be a good chance to unify some of the decode tables.
This is a follow-up to #19.
These enums and structures are mostly identical across all three processor modes, and it is useful to combine these for writing code that is generic to all three modes. In order to access these common fields, a new trait
X86Instruction
(open for naming suggestions) has been added to provide access to these fields.The trait is kind of janky to use as of now: you must declare the bound with a
where
clause: