nemerle / dcc

This is a heavily updated version of the old DOS executable decompiler DCC
GNU General Public License v2.0
146 stars 27 forks source link

Long signatures #20

Open lab313ru opened 8 years ago

lab313ru commented 8 years ago

I have read readsig.txt, and have found that currently, signatures are 23 bytes long. Is it true? If so, is it possible to create signatures that will be longer? And, why is 23? How much signatures are missed in DCC because of this (and because of collisions)?

uxmal commented 8 years ago

The DCCS file format hard-wires it to 23 bytes. Sprinkled in the DCC code you find:

#define  PATLEN 23

Goodness knows why the original authors chose that size; it's likely to avoid excessive file sizes on a project that may have been developed on an x86 MS-DOS environment.

You could try changing PATLEN to a larger number and then running makedsig on a *.LIB file, but the resulting signature file will be incompatible with older DCCS files.

lab313ru commented 8 years ago

What I want: Regenerate new libs, that will be with all functions, and almost without collisions. Then, I want apply them in IDA.

uxmal commented 8 years ago

You'll need access to the original LIB files from the various compilers to accomplish this, since only the first 23 bytes are available in the existing DCCS files.

lab313ru commented 8 years ago

I understand.

Another question: signature models: dccb3c.sig dccb3l.sig

"l" and "c" - what the difference?

uxmal commented 8 years ago

The naming convention used here is: dcc<v><n><m> where

lab313ru commented 8 years ago

Ok. There is some compiler lib-file. Which model will be selected and which criteria will be used when naming it? There is some lib-inner memory model?

uxmal commented 8 years ago

When generating the signatures using makedsig, the user herself has to know what vendor, version and model the LIB file was compiled with.

lab313ru commented 8 years ago

But makedsig only asks libname as parameter.

uxmal commented 8 years ago

If you look at makedsig.cpp, you'll find the usage:

"This program is to make 'signatures' of known c and tpl library calls for the dcc program.\n"
"It needs as the first arg the name of a library file, and as the second arg, the name "
"of the signature file to be generated.\n"
"Example: makedsig CL.LIB dccb3l.sig\n"
"      or makedsig turbo.tpl dcct4p.sig\n"

So it's the user's responsibility to provide a correct file name for the .sig file.

lab313ru commented 8 years ago

Ah, I see. dcc selects correct sig file. And I should provide correct file name.

nemerle commented 8 years ago

Exactly so.

AFAIK there is no identification information contained inside lib/tpl files.

uxmal commented 8 years ago

Reko will probably use a variant of this scheme, but the mapping of signature files may be happening in the configuration file to avoid dependencies on the naming of the signature files themselves.

nemerle commented 8 years ago

Yup, the format of signature files could be made a bit more robust:

{
    "Vendor": "Borland",
    "CompilerName" : "TurboC 3.0",
    "Language": "C",
    "Version": "3.0",
    "SignatureBlocks": [{
    "Model": "Large",
    "SigLength": 29,
    "Signatures": []
    }, {
    "Model": "Small",
    "SigLength": 23,
    "Signatures": []
    }]
}

and makedsig could be made to work with this to 'add'/'update' signatures inside this files

lab313ru commented 8 years ago

Makedsig asks me for "Seed:". What is it?

lab313ru commented 8 years ago

And second question: how to merge signatures from different lib fies?

uxmal commented 8 years ago

Consider using a schema as well, so a JSON parser can identify what kind of data this is:

{
    "$schema":  "urn:executable:signature",
    "Vendor": ....
} 

Merging signatures from different lib files should done by relevant decompilers when they "ingest" the JSON described above. Ie. there should be a function LoadSignaturesFromFiles: list<filename> => internal-signature-representation that collects all relevant metadata and "cooks" it as appropriate.

This work is underway on the Reko project: there are at least three signature file formats that Reko is aware of, and I'm making it so that they all get unified internally . It would be cool if dcc and Reko could interoperate on this level.

uxmal commented 8 years ago

The DCC signature file format creates a perfect hash. The algorithm they are using requires a random number generator (RNG). The Seed: prompt is asking you for a seed to the (RNG). Not sure why this is provided explicitly, perhaps for making sure, during development, that the hashtable is getting created correctly and reproducibly. Just enter some number < 32637 and you should be OK.

uxmal commented 8 years ago

.lib files (and .obj files) are OMF files. Sadly, they have no magic number at the beginning, so you have to depend on file extensions to figure out what's inside. This is why I'm suggesting the $schema above -- so that both humans and computers can figure out the contents of the file.

nemerle commented 8 years ago

Common signature format: agreed - will try to flesh it out and post it here.

uxmal commented 8 years ago

Also, consider looking at the Yara format. It's not JSON, but we could consider making a JSON compatible version.

nemerle commented 8 years ago

John, should we consider other pattern schemes ?

Once upon a time I've had some fun with an xbox emulator that used pattern matching to identify SDK functions, and rewritten it to use pre-generated per-SDK TRIE ( string with wildcards )

nemerle commented 8 years ago

As for YARA, I think their pattern matching language is not a very good match for our purposes ?

What we might consider is pattern disambiguation by symbol names ?

given two patterns with the same signature:

FuncA:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncX
FuncB:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncY

we would be unable to correctly locate those patterns in the binary, but if previously we managed to locate either FuncX, or FuncY then we could use those to augment the pattern matcher?

uxmal commented 8 years ago

Reko uses another signature format, provided by @halsten, for identifying packers and unpackers. It is again different:

<SIGNATURES>
  <ENTRY>
    <NAME>Microsoft Visual C++ 7</NAME>
    <COMMENTS />
    <ENTRYPOINT>????4100000000000000630000000000??00??????????00??00??????????????????????????????????00??00??00??????????????????????????????00????20????00??00??????????????00??????????????????????00??00??????00??????????????00??00??00??00??00??00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????????????????????????????00??????????????????????00??00??00??????00??00??00??00??00??00</ENTRYPOINT>
    <ENTIREPE />
  </ENTRY>
  <ENTRY>
    <NAME>Microsoft Visual C++ 8.0</NAME>
    <COMMENTS>
    </COMMENTS>
    <ENTRYPOINT>4883EC28E8????00004883C428E9????FFFFCCCCCCCCCCCCCCCCCCCCCCCCCCCC</ENTRYPOINT>
    <ENTIREPE>
    </ENTIREPE>
  </ENTRY>

As you can see it is just yet another variant with its own benefits and flaws. It's easy enough to add parsers to these simple formats. The hard part is building an efficient automaton from the patterns in order to scan the decompiled image fast enough. My recent commits in Reko have introduced a suffix array implementation that lets me locate a pattern in O(log n) time, where n is the size of the binary file. Once that work is complete, I should be able to rip through any signature file format (like the one above, the one you're proposing, or the Amiga index hunks, or the DCC signature files) and in O(p * log N) time find all matching signatures located in the file, where p is the number of patterns. Any way to decrease the p -- say by partitioning signature files based on detected compiler manufacturer and version -- is of course highly beneficial.

My intent with Reko is to be able to handle as many formats as possible, but drawing the line when it gets too complex and distracts me from actual decompilation :-)

lab313ru commented 8 years ago

IDA understands BCC's libs format. (plb utility).

nemerle commented 8 years ago

@lab313ru I don't believe we can use any of their tooling in our open source projects though ?

lab313ru commented 8 years ago

We can't, yes. But idea of FLIRT signatures is good for using it. https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml

And my goal was to use bcc signatures in IDA, so...)

uxmal commented 8 years ago

@lab313ru: are FLIRT signatures stored as text, or as a binary format described somewhere? I have no access to IDA so I can't go check myself.

lab313ru commented 8 years ago

Hmm.. I think, pat description is only IDA SDK-inner.

But, it is not problem to rewrite signmake to use max length for symbol names and for pattern length.

lab313ru commented 8 years ago

Maybe, for current moment it will be better to allow makedsig read file list with lib-files? Then add signatures from them to map, and parse as it were before?

I mean combining symbols from many lib-files.

nemerle commented 8 years ago

Started writing the spec for the pattern files:

https://github.com/nemerle/dcc/wiki/Cross-decompiler-signature-specification

uxmal commented 8 years ago

Patterns need to be specified:

nemerle commented 8 years ago

Updated with: EBNF-like definition for PATTERN definition

PATTERN :  ("Offset" Number (MATCH_BYTES | SYM_REF_NAME))+
MATCH_BYTES : (HEX_BYTE | WILDCARD)+
HEXBYTE : "0x" HEX_DIGIT HEX_DIGIT
WILDCARD : "." | "?"
SYM_REF_NAME :  Ident
nemerle commented 8 years ago

Although more compact representation of MATCH_BYTES might be in order ?

uxmal commented 8 years ago

If it's OK to assume hexadecimal representation and 8-bit bytes, you could get rid of the "0x" which adds nothing but padding in that case. Reko has a couple of megabytes of signature files donated by @halsten which all have following look: AD3351?????AEB1A2?????. It appears to be widely used in the community, and would be nice to provide support for it.

Here's my take on a pattern file format, generalizing a little because not all emitters of machine code are compilers (think obfuscators and packers)

{
    // The defaults if nothing else has been specified
    "Tags": {
        "Vendor": "Borland",
        "Product": "Turbo C",
        "Version": "2.0",
        "Target_machine": "x86-16",
        "Endianness": "little".
        "SourceLanguage", "C"
    },
    "Patterns": [
        {
            "Tags": {
                "Version": "3.0"
            },
            //  4-byte reference to a symbol
            "Match": [ "AAbbCC??D1e2", { "symref": "foo", "size": 4 }, "Fa",

            "Result": { "symbol": "malloc" }
        }
    ]
}

Here is a pattern that could be used to identify a binary as Msdos EXE or ELF

{
    "Patterns": [
        {
            // must be at start of file. Not specifying offset means "anywhere"
            "Offset": 0,
            "Match": ["4D5A"],
            "Result": { "imagefile": "MzExecutable" }
        },
        {
            "Offset": 0,
            "Match: ["7F454C46"],
            "Result": { "imagefile", "ElfExecutable" }
        }
    ]
}

It would be cool if "Offset" could be specified to not only be a fixed number of bytes from the start of file, but a special symbol "$EntryPoint" which would be the starting point of the program as defined by the image format (PE, ELF etc)

{
   "Offset", "$EntryPoint",
   "Match":  ["7F3A39....A3B8"],
   "Result": { "Packer": "FileCrusher", "Vendor": "Packers'R'us", "Version": "0.3" }
}
nemerle commented 8 years ago

We might want/need to add a Compiler_Flags as a required tag, since patterns for Debug/Release Small/Medium/Large builds will differ

nemerle commented 8 years ago

Ok, I've extended/updated the EBNF for PATTERN and DATA parts to incorporate Your suggesstions:

PATTERN:

PATTERN :  PATTERN_ID? ("Offset" OFFSET_SPEC (MATCH_BYTES | SYM_REF_NAME))+ | "@" PATTERN_REF;
PATTERN_ID : Ident;
OFFSET_SPEC : Number | "$EntryPoint";
PATTERN_REF : Ident;
MATCH_BYTES : "[" (HEX_BYTE | WILDCARD)+ "]";
HEXBYTE : HEX_DIGIT HEX_DIGIT;
WILDCARD : "." | "?";
SYM_REF_NAME : Ident;

DATA:

DATA:          (SYMBOL_DEF META_DEF?) | META_DEF;
META_DEF:      "Meta" FREEFORM_DATA;
SYMBOL_DEF:    "Symbol" SYMBOL_NAME ("Typedef" C_TYPEDEF)?;
SYMBOL_NAME:   "Name" Ident; // Ident is a raw symbol name - no demangling should be done here
C_TYPEDEF:     QuotedString; // C typedef extended with custom calling convention attributes
FREEFORM_DATA: (Ident "=" QuotedString)+; // comments, links to documentation, etc.

As for FREEFORM_DATA - it could be extended into:

META_ENTRY: PACKER_SPEC | LOADER_SPEC | FREEFORM_DATA;
PACKER_SPEC: "Packer" QuotedString;
LOADER_SPEC: "Loader" QuotedString;
FREEFORM_DATA: (Ident "=" QuotedString)+;