Disassembly view showing wrong lines

chostelet commented 6 years ago

Hello, I'm having some issues with the Disassembly view as show in attached screen shots. It looks like this view is enforcing word-alignment per processor architecture, and does not accept/expect (pseudo-)instructions with odd addresses. First example is showing a Data EEPROM segment (byte-sized, byte-aligned, odd number of bytes). The last line is spurious.

Second example, probably more obvious) shows the same memory segment but with last byte (at even address) being selected. Disassembly is wrong (repeated lines again)

Third example is showing an other memory segment (the configuration fuses, byte-sized, byte-aligned) for which, depending on the byte being selected in Memory View, the disassembly view is wrong too.

Selecting even-address byte:

Selecting odd-address byte.

I fear the Disassembly view is looking (for once and only) at the processor alignment (Architecture -> InstructionBitSize?) while, on contrary, some loaded segments can represent byte-aligned datum and others segments can represent word-aligned datum.

As discussed before, the Architecture should be amended to address the need for information describing the processor's memory spaces.

chostelet commented 6 years ago

Update: For the PIC18, I resolved this display issue by setting 'InstructionBitSize=8; ' in Architecture. However I foudn an other issue with the Memory View update. Shall open an other issue tpic.

I leave this current issue for possible discussion around the need for characterizing the processor memory spaces to get clever raw format loader(s), memory viewer and disassembly viewer tailored to memory segments, etc...

uxmal commented 6 years ago

To paraphrase Fermat: "I have some comments on this topic, but typing them on a smartphone is too tedious. " I hope to be back in town on Sunday if the Siberian winds cooperate.

uxmal commented 6 years ago

Thanks for taking the time to look at this. Here are some of my thoughts.

The "stutter" at the end of the segment may be caused the PIC18 disassembler not behaving according to "spec". Given that the "spec" is not explcitly written anywhere, this may be an easy thing to fix. I will reach out to @chostelet separately to see if we can make this work correctly.

Rather than extending IProcessorArchitecture, it seems more appropriate to let the different loaders indicate the byte granularities / alignments of each loaded segment. The the raw image loader falls back to a granularity specified by the IProcessorArchitecture. This way, if you have an arcane file format, the image loader can generate ImageSegments with the requisite granularities. I think this should address your immediate concerns.

However, this issue dovetails in with a separate but related issue I've been mulling over for a while. I have in my "Zoo" of binaries an example where the executable file is a real-mode MS-DOS binary, but in which there are "islands" of code which are written in a custom FORTH-like language. In those islands, not only is the IProcessorArchitecture completely different (FORTH instead of x86) but also the granularity of the "islands" is different from that of the main executable.

As an additional example, the vast majority of C64 games were written as a small BASIC stubs that could look something like:

10 SYS 2063

where 2063 is the (decimal) address of the 6510 machine code of the actual game. Currently Reko has no way of handling transitions from the CBM-64 BASIC "architecture" to 6510.

In a similar vein, the RTOSDemo sample in the Reko regression suite is an ELF image that specifies the ARM architecture, but most of the exported functions symbols have an odd LSB which means that they should be treated as ARM THUMB functions. RTOSDemo currently is failing massively because Reko's inability to handle this.

To get this working properly for the use cases we have identified, (at least) the following things have to occur:

Like today, Program has an Architecture property which indicates the default architecture of the current decompiled program if nothing else is specified.
Program needs a Dictionary<string,IProcessorArchitecture> that keeps track of all architectures Reko is aware of in the binary. The string is a short yet human readable name. In the ARM example above, you'd see one dictionary entry labelled "ARM Thumb" and the other "ARM"; the CBM-64 example would be "CBM-64 Basic" / "MOS 6502". The user should be able to add arbitrary new architectures as needed. This dictionary needs to saved in the project file format.
Each ImageSegment (which are created by ImageLoaders) should have an Architecture property which, when specified, should override the Architecture property of Program.
Each ImageSegment should in addition have a Granularity property which specifies the granularity in bits.
Each ImageSymbol (which are also created by the ImageLoaders) should also have an Architecture property which, when specified, should override the Architecture of the image segment the symbol is located in. This would allow the ELF ARM loader to specify the ARM THUMB for symbols whose addresses are odd-numbered.
The ImageMapItem class should also have a Granularity and Architecture property. When ImageMapItems are created these properties get their values from the ImageSegment, but may be overwritten as a result of scanning ImageSymbols with different architectures.
When the user right-clicks on a chunk of memory in the memory viewer and selects Mark as Procedure there should be a way to override the default Architecture in Program. Perhaps it should pop up a new "Mark as Procedure" dialog which uses defaults taken from the ImageSegment that was clicked on but which allows the user to override these at will.
The Reko.Scanning classes must respect this scheme of overrides. Specifically
- The recursive parser uses the Architecture specified by its 'seeds' (the ImageSymbols), which may not necessarily be the same architecture as the one in Program.
- The shingle scanner respectes the Architecture specified by the ImageSegment unless overridden by an ImageSymbol or a user-specified procedure.
- Any user-specified procedures must use the values entered by the user in the Mark as Procedure dialog.
A mechanism for rewriters to indicate to the Scanner that it should transition to a different architecture.
Memory viewers and dumpers must respect granularities and architectures.

This is a rough draft, but clearly the amount of work involved is large enough that I'd want to embark on it on a separate branch and perhaps track it as a project. I will leave off here and recover from jet-lag, but I encourage discussion on the topic.

chostelet commented 6 years ago

Hello As noted above, I've resolved the case of PIC18 by stating it was byte-aligned architecture as well as creating an (invalid) unaligned instruction in case the user selects an odd address.

Now just to add some more tricky things to the discussion on memory segment characteristics, I'd like to highlight that for the PIC processors, Microchip is providing two items for each segment. The Location size and the Word size. Here are the characteristics of the PIC18 and PIC16:

  <ArchDef name="18xxxx">
      <CodeMem          locsize="2" wordsize="2"/>
      <ExtCodeMem       locsize="2" wordsize="2"/>
      <BackgroundDebugMem   locsize="2" wordsize="2"/>
      <TestMem          locsize="2" wordsize="2"/>
      <UserIDMem        locsize="1" wordsize="1"/>
      <ConfigFuseMem        locsize="1" wordsize="1"/>
      <ConfigWORMMem        locsize="1" wordsize="1"/>
      <DeviceIDMem      locsize="1" wordsize="1"/>
      <DataMem          locsize="1" wordsize="1"/>
  </ArchDef>

  <ArchDef name="16xxxx">
      <CodeMem          locsize="1" wordsize="2"/>
      <CalDataMem       locsize="1" wordsize="2"/>
      <BackgroundDebugMem   locsize="1" wordsize="2"/>
      <TestMem          locsize="1" wordsize="2"/>
      <UserIDMem        locsize="1" wordsize="2"/>
      <ConfigFuseMem        locsize="2" wordsize="2"/>
      <DeviceIDMem      locsize="1" wordsize="2"/>
      <DataMem          locsize="1" wordsize="1"/>
  </ArchDef>

This means, for example with the PIC18, while the Code segment contains 2-byte datum (instructions), the adresses are considered byte-addresses. Contrary to the PIC16 , whose Code segment contains 2-byte datum (in fact 12 or 14 bit-wide datum) and the adresses are considered word-addresses. This has an influence both on the way the memory view should present the data as well as on the disassembly output - sequential addresses in a Code segment are incremented by 1, even if this Code segment contains words.

uxmal commented 6 years ago

If I understand this correctly, it's similar to ancient 36-bit machines (in the days before the IBM 360 dictated that the smallest addressable memory unit is an 8-bit byte/octet). As an example, the base model PDP-10 had an address space of 2^18 words, each of which was 36 bits. To read or write bytes -- which on these machines were not octets but could be anything from 1 to 36 bits! -- you had to use special instructions. The DPB (deposit byte) expression of the Reko IR is named after one of those special instructions.

Currently the ImageReader and ImageWriter classes assume that all memory is byte addressable; the majority of architectures you're likely to encounter are indeed byte addressable. To handle scenarios like PIC16 or PDP-10, where the granularity of memory is different, I suggest for now to create separate subclasses Pic16ImageReader and Pdp10ImageReader, which can map addresses to 14-bit / 36-bit data units respectively. ImageReaders and ImageWriters are currently created from the IProcessorArchitecture because they have to respect the endianness of the processor. Perhaps the IProcessorArchitecture.CreateImageReader and CreateImageWriter methods will need a new granularity parameter.

To render memory areas properly, the MemoryControl and the Dumper class will have to define an interaction model so that they respect the granularity of the memory areas they are viewing.

chostelet commented 6 years ago

Not sure if this is similar to PDP-10 as even if only 14 bits are significant to the PIC16, the location/word sizes remain multiple of 8-bit bytes. Simply the high order two bits of each program word are ignored. In peculiar there is no packing of 14-bit words into multiple bytes in the Intel Hex image. A 14-bit word from program region is represented as 2 bytes (little endian). Again this discussion comes around the IProcessorArchitecture interface which would need to provide those granularity (location/word size) information, and more generally the structural memory information of the target processor.

When I'll have more leisure time, I'll have a look at MemoryControl and Dumper classes to see they can have PIC16/PIC18 flavored versions. For now I'm starting to implement the PIC16 to see how far we get commonalities with PIC18.

uxmal commented 6 years ago

Think of the ImageReader/Writers as abstraction layers where calling code is decoupled from how exactly the bits are stored. A Pic16ImageReader could store 14-bit words in little-endian format consecutive pairs of bytes, and the caller wouldn't be the wiser. This class could expose a special ReadLeUInt14() if you want to be absolutely sure that the top bits are clear, or the calling code (Pic16Disassembler and MemoryControl + Dumper) calls the existing ReadLeUInt16() method.

The point is, there should hopefully be no need for MemoryControl and Dumper to be specialized for each architecture. Instead, I think the IProcessorArchitecture to provide a Dictionary of possible valid representation formatter methods -- of which one is the default -- and then have the MemoryControl or Dumper call the corresponding formatter with the granularity set on the ImageSegment.

chostelet commented 6 years ago

I'll have indeed to think further! In that case, this PICImageReader/Writer will have to be "informed" of the characteristics (traits) of the memory areas (addresses range) it is reading/writing. As shown above all but the last entry (DataMem) are descriptors of program memory of those PIC. A binary image (IntelHEX) could then contain any of those segments. Not sure the decision of interpreting those traits ("should I return a byte, a 12-bit word, a 14-bit, what else?") could be at this low level.

uxmal / reko

Disassembly view showing wrong lines #576