(finally) land several often-applicable decode optimizations

PR on github for visibility, but i'll be merging this shortly after clicking Create pull request.

changes described in the changelog:

* optimizations (mostly code motion) for hot codepaths
  - large `match`-based decode tables have been outlined to 256-entry arrays.
    this makes for slicely nicer inlining in `read_with_annotations`.
  - vex/evex decoding in 64-bit decoding now shares more code. this seems to
    aid code cache friendliness when prefixes must be read.
  - added a fast path for operand reading for the more-likely cases of
    [64-bit]: {0x66,rex}{<opcode>,0x0f-<opcode>}
    [32-bit]: {0x66}{<opcode>,0x0f-<opcode>}
    [16-bit]: {0x66}{<opcode>,0x0f-<opcode>}

    in particular, this avoids checking for instruction length overflows and
    some bounds checks when we aren't handling a pessimal case of many-prefixed
    instructions. if an instruction has multiple prefixes, decoders fall back
    to normal read-in-a-loop-until-length-limit-reached decoding.

i'd actually these were useful optimizations for the 64-bit decoder early in the year, but became increasingly encumbered with "one more thing" to the point that i'd never landed them. i'd also, in the process, forgotten to actually publish yaxpeax-x86 1.1.5. so i'm cleaning up this long-outstanding work, will merge, then publish shortly after.

i'm not actually sure if these optimizations help as much in the 32-bit or 16-bit decoders. the LUT-for-bank-lookup change almost certainly does not. others, like a fast path to bypass the decode loop, probably do help a bit. i have not measured these and do not plan to. my priority for 32-bit and 16-bit decoders is to keep them substantially similar to the 64-bit decoder, as i'm optimistic this substantially-similar code can be written with less... almost-duplication..

iximeow / yaxpeax-x86

(finally) land several often-applicable decode optimizations #27