Open chostelet opened 6 years ago
Update: For the PIC18, I resolved this display issue by setting 'InstructionBitSize=8; ' in Architecture. However I foudn an other issue with the Memory View update. Shall open an other issue tpic.
I leave this current issue for possible discussion around the need for characterizing the processor memory spaces to get clever raw format loader(s), memory viewer and disassembly viewer tailored to memory segments, etc...
To paraphrase Fermat: "I have some comments on this topic, but typing them on a smartphone is too tedious. " I hope to be back in town on Sunday if the Siberian winds cooperate.
Thanks for taking the time to look at this. Here are some of my thoughts.
The "stutter" at the end of the segment may be caused the PIC18 disassembler not behaving according to "spec". Given that the "spec" is not explcitly written anywhere, this may be an easy thing to fix. I will reach out to @chostelet separately to see if we can make this work correctly.
Rather than extending IProcessorArchitecture
, it seems more appropriate to let the different loaders indicate the byte granularities / alignments of each loaded segment. The the raw image loader falls back to a granularity specified by the IProcessorArchitecture
. This way, if you have an arcane file format, the image loader can generate ImageSegments
with the requisite granularities. I think this should address your immediate concerns.
However, this issue dovetails in with a separate but related issue I've been mulling over for a while. I have in my "Zoo" of binaries an example where the executable file is a real-mode MS-DOS binary, but in which there are "islands" of code which are written in a custom FORTH-like language. In those islands, not only is the IProcessorArchitecture
completely different (FORTH instead of x86) but also the granularity of the "islands" is different from that of the main executable.
As an additional example, the vast majority of C64 games were written as a small BASIC stubs that could look something like:
10 SYS 2063
where 2063 is the (decimal) address of the 6510 machine code of the actual game. Currently Reko has no way of handling transitions from the CBM-64 BASIC "architecture" to 6510.
In a similar vein, the RTOSDemo
sample in the Reko regression suite is an ELF image that specifies the
ARM architecture, but most of the exported functions symbols have an odd LSB which means that they should be treated as ARM THUMB functions. RTOSDemo currently is failing massively because Reko's inability to handle this.
To get this working properly for the use cases we have identified, (at least) the following things have to occur:
Program
has an Architecture
property which indicates the default architecture of the current decompiled program if nothing else is specified.Program
needs a Dictionary<string,IProcessorArchitecture>
that keeps track of all architectures Reko is aware of in the binary. The string
is a short yet human readable name. In the ARM example above, you'd see one dictionary entry labelled "ARM Thumb" and the other "ARM"; the CBM-64 example would be "CBM-64 Basic" / "MOS 6502". The user should be able to add arbitrary new architectures as needed. This dictionary needs to saved in the project file format.ImageSegment
(which are created by ImageLoader
s) should have an Architecture
property which, when specified, should override the Architecture
property of Program
. ImageSegment
should in addition have a Granularity
property which specifies the granularity in bits. ImageSymbol
(which are also created by the ImageLoader
s) should also have an Architecture
property which, when specified, should override the Architecture
of the image segment the symbol is located in. This would allow the ELF ARM loader to specify the ARM THUMB for symbols whose addresses are odd-numbered.ImageMapItem
class should also have a Granularity
and Architecture
property. When ImageMapItem
s are created these properties get their values from the ImageSegment
, but may be overwritten as a result of scanning ImageSymbol
s with different architectures.Mark as Procedure
there should be a way to override the default Architecture
in Program
. Perhaps it should pop up a new "Mark as Procedure" dialog which uses defaults taken from the ImageSegment
that was clicked on but which allows the user to override these at will.Reko.Scanning
classes must respect this scheme of overrides. Specifically
Architecture
specified by its 'seeds'
(the ImageSymbol
s), which may not necessarily be the same architecture
as the one in Program
.Architecture
specified by the ImageSegment
unless overridden
by an ImageSymbol
or a user-specified procedure.Mark as Procedure
dialog.Scanner
that it should transition to a different architecture. This is a rough draft, but clearly the amount of work involved is large enough that I'd want to embark on it on a separate branch and perhaps track it as a project. I will leave off here and recover from jet-lag, but I encourage discussion on the topic.
Hello As noted above, I've resolved the case of PIC18 by stating it was byte-aligned architecture as well as creating an (invalid) unaligned instruction in case the user selects an odd address.
Now just to add some more tricky things to the discussion on memory segment characteristics, I'd like to highlight that for the PIC processors, Microchip is providing two items for each segment. The Location size and the Word size. Here are the characteristics of the PIC18 and PIC16:
<ArchDef name="18xxxx">
<CodeMem locsize="2" wordsize="2"/>
<ExtCodeMem locsize="2" wordsize="2"/>
<BackgroundDebugMem locsize="2" wordsize="2"/>
<TestMem locsize="2" wordsize="2"/>
<UserIDMem locsize="1" wordsize="1"/>
<ConfigFuseMem locsize="1" wordsize="1"/>
<ConfigWORMMem locsize="1" wordsize="1"/>
<DeviceIDMem locsize="1" wordsize="1"/>
<DataMem locsize="1" wordsize="1"/>
</ArchDef>
<ArchDef name="16xxxx">
<CodeMem locsize="1" wordsize="2"/>
<CalDataMem locsize="1" wordsize="2"/>
<BackgroundDebugMem locsize="1" wordsize="2"/>
<TestMem locsize="1" wordsize="2"/>
<UserIDMem locsize="1" wordsize="2"/>
<ConfigFuseMem locsize="2" wordsize="2"/>
<DeviceIDMem locsize="1" wordsize="2"/>
<DataMem locsize="1" wordsize="1"/>
</ArchDef>
This means, for example with the PIC18, while the Code segment contains 2-byte datum (instructions), the adresses are considered byte-addresses. Contrary to the PIC16 , whose Code segment contains 2-byte datum (in fact 12 or 14 bit-wide datum) and the adresses are considered word-addresses. This has an influence both on the way the memory view should present the data as well as on the disassembly output - sequential addresses in a Code segment are incremented by 1, even if this Code segment contains words.
If I understand this correctly, it's similar to ancient 36-bit machines (in the days before the IBM 360 dictated that the smallest addressable memory unit is an 8-bit byte/octet). As an example, the base model PDP-10 had an address space of 2^18 words, each of which was 36 bits. To read or write bytes -- which on these machines were not octets but could be anything from 1 to 36 bits! -- you had to use special instructions. The DPB
(deposit byte) expression of the Reko IR is named after one of those special instructions.
Currently the ImageReader
and ImageWriter
classes assume that all memory is byte addressable; the majority of architectures you're likely to encounter are indeed byte addressable. To handle scenarios like PIC16 or PDP-10, where the granularity of memory is different, I suggest for now to create separate subclasses Pic16ImageReader
and Pdp10ImageReader
, which can map addresses to 14-bit / 36-bit data units respectively. ImageReaders
and ImageWriters
are currently created from the IProcessorArchitecture
because they have to respect the endianness of the processor. Perhaps the IProcessorArchitecture.CreateImageReader
and CreateImageWriter
methods will need a new granularity
parameter.
To render memory areas properly, the MemoryControl
and the Dumper
class will have to define an interaction model so that they respect the granularity of the memory areas they are viewing.
Not sure if this is similar to PDP-10 as even if only 14 bits are significant to the PIC16, the location/word sizes remain multiple of 8-bit bytes. Simply the high order two bits of each program word are ignored. In peculiar there is no packing of 14-bit words into multiple bytes in the Intel Hex image. A 14-bit word from program region is represented as 2 bytes (little endian). Again this discussion comes around the IProcessorArchitecture interface which would need to provide those granularity (location/word size) information, and more generally the structural memory information of the target processor.
When I'll have more leisure time, I'll have a look at MemoryControl and Dumper classes to see they can have PIC16/PIC18 flavored versions. For now I'm starting to implement the PIC16 to see how far we get commonalities with PIC18.
Think of the ImageReader/Writer
s as abstraction layers where calling code is decoupled from how exactly the bits are stored. A Pic16ImageReader
could store 14-bit words in little-endian format consecutive pairs of bytes, and the caller wouldn't be the wiser. This class could expose a special ReadLeUInt14()
if you want to be absolutely sure that the top bits are clear, or the calling code (Pic16Disassembler
and MemoryControl
+ Dumper
) calls the existing ReadLeUInt16()
method.
The point is, there should hopefully be no need for MemoryControl
and Dumper
to be specialized for each architecture. Instead, I think the IProcessorArchitecture
to provide a Dictionary
of possible valid representation formatter methods -- of which one is the default -- and then have the MemoryControl
or Dumper
call the corresponding formatter with the granularity set on the ImageSegment
.
I'll have indeed to think further! In that case, this PICImageReader/Writer will have to be "informed" of the characteristics (traits) of the memory areas (addresses range) it is reading/writing. As shown above all but the last entry (DataMem) are descriptors of program memory of those PIC. A binary image (IntelHEX) could then contain any of those segments. Not sure the decision of interpreting those traits ("should I return a byte, a 12-bit word, a 14-bit, what else?") could be at this low level.
Hello, I'm having some issues with the Disassembly view as show in attached screen shots. It looks like this view is enforcing word-alignment per processor architecture, and does not accept/expect (pseudo-)instructions with odd addresses. First example is showing a Data EEPROM segment (byte-sized, byte-aligned, odd number of bytes). The last line is spurious.
Second example, probably more obvious) shows the same memory segment but with last byte (at even address) being selected. Disassembly is wrong (repeated lines again)
Third example is showing an other memory segment (the configuration fuses, byte-sized, byte-aligned) for which, depending on the byte being selected in Memory View, the disassembly view is wrong too.
Selecting even-address byte:
Selecting odd-address byte.
I fear the Disassembly view is looking (for once and only) at the processor alignment (Architecture -> InstructionBitSize?) while, on contrary, some loaded segments can represent byte-aligned datum and others segments can represent word-aligned datum.
As discussed before, the Architecture should be amended to address the need for information describing the processor's memory spaces.