radareorg / radare2

UNIX-like reverse engineering framework and command-line toolset
https://www.radare.org/
GNU Lesser General Public License v3.0
20.45k stars 2.98k forks source link

Reversible binary patch format #23379

Open rodarima opened 1 week ago

rodarima commented 1 week ago

I would like to generate some binary patches that can be read in plain text, in the same way I do with diff(1) and patch(1). In particular, I want to be able to do these operations:

The default radiff2(1) format is close to what I want.

% radiff2 v2.bin v3.bin
0x00189ca8 58731b => 004c1d 0x00189ca8

It has the benefit that it can be reversed by a simple awk(1) program:

% cat a.patch
0x00189ca8 58731b => 004c1d 0x00189ca8
% awk '{print $5,$4,$3,$2,$1}' < a.patch
0x00189ca8 004c1d => 58731b 0x00189ca8

However, AFAIK this format doesn't seem to be accepted by any tool.

The r2 format outputs radare2(1) commands, but they ignore what was in that address before:

% radiff2 -r v2.bin v3.bin
wx 004c1d @ 0x00189ca8

This is not enough, as I want to know if a give patch collides with another one. This also prevents from reverting an applied patch.

There is also the rapatch.md format, but it seems to be different than these two. And it also seems to have the same problem, it cannot be reverted.

Maybe this problem can be solved by implementing a reversible operator like wx that swap bytes instead of overwriting an address.

<swap> 004c1d 58731b @ 0x00189ca8

The problem of this approach is that when a swap command fails, it should output that hunk into a reject file, which is probably not what you want from a r2 session.

Maybe it would be a better idea to have another tool just for this workflow (which could also work with multiple files at once). You first perform all the changes you want with r2 w commands, then you save the file and generate a patch that can be further edited and applied/reversed:

% radiff2 a.bin b.bin > patch.txt
% $EDITOR patch.txt
% rapatch < patch.txt
% rapatch -R < patch.txt # Reverse the patch

Here is an example of what that patch may look like, which is very close to what patch(1) expects:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200

The following hunk changes the value 1799000 to 1920000 (in decimal)
at the address 0x00189ca8. Notice how I use LE to specify little endian,
so I can see the raw values clearly.

@@ -0x00189ca8,4 +0x00189ca8,4 @@
- LE 0x001b7358 # 1799000 comment after the # symbol, could be assembly
+ LE 0x001d4c00 # 1920000

The benefit of such format is that:

This format can also be used to insert or remove bytes, leaving a different sized file. It also prevents the problem of using multiple write commands for the same memory location if the hunk addresses are sorted. It also resembles the patch format closely enough that it gets the syntax colors of normal patches on GitHub.

Patches of patches are also readable:

% diff -u patch1.txt patch2.txt
--- patch1.txt  2024-09-24 09:29:00.823234555 +0200
+++ patch2.txt  2024-09-24 09:29:18.273200860 +0200
@@ -1,10 +1,10 @@
 --- a.bin  2024-09-24 09:24:41.475235346 +0200
 +++ b.bin  2024-09-24 09:24:41.475235346 +0200

-The following hunk changes the value 1799000 to 1920000 (in decimal)
+The following hunk changes the value 1799000 to 1920001 (in decimal)
 at the address 0x00189ca8. Notice how I use LE to specify little endian,
 so I can see the raw values clearly.

 @@ -0x00189ca8,4 +0x00189ca8,4 @@
 - LE 0x001b7358 # 1799000 comment after the # symbol, could be assembly
-+ LE 0x001d4c00 # 1920000
++ LE 0x001d4c01 # 1920001

I think I could adapt radiff2.c to output such format, and maybe modify patch(1) to accept them.

trufae commented 1 week ago

use the -r flag

rodarima commented 1 week ago

As of 2578ff0ac, using radiff2 -r produces a patch that is not reversible:

% radiff2 -r v2.bin v3.bin
wx 004c1d @ 0x00189ca8

% radiff2 -v
radiff2 5.9.5 32634 @ linux-x86-64
birth: git.5.9.4-231-g2578ff0ac5 2024-09-24__17:40:00
commit: 2578ff0ac57765e0c5908fb6559bbbfd86252c12
options: gpl -O1 cs:5 cl:2 meson
trufae commented 1 week ago

you can also use -1 output in Generic binary DIFF (0xd1ffd1ff magic header) as well as -X show two column hexII diffing. but i agree that all that should be probably unified into a single flag. what is your proposal? the output of -r is compatible with r2. its an r2 script. and ideally this script should work too with r0 (aka ired). but it's just stuff from radare. which other bindiffing tools are you caring?

The proposal to create a decent and standarized and extensible binary patching file format in plain text looks quite interesting to me, and i would love to have support in r2 for that. btw there's also support for rapatch. but its just part of r2, not a standalone tool. but for consistency with radiff2 it probably makes sense to have a rapatch2 tool instead of having an r2 uppercase flag.

$ r2 -h| grep -i patch
Usage: r2 [-ACdfjLMnNqStuvwzX] [-P patch] [-p prj] [-a arch] [-b bits] [-c cmd]
 -P [file]    apply rapatch file and quit

You can read more about this in doc/rapatch.md

trufae commented 6 days ago

just created a new tool that cant bemerged until r2-6.0

https://github.com/radareorg/radare2/pull/23391

for now is just a dummy thing, but I agree that your proposal is important and should be treated as a first class tool, would you like to improve radiff2 to support this output?

i'm not 100% sure about the LE/BE values because radiff just spots changes in byte which may not really know if the underlying data is a word or qword. the patch format can specify that or maybe we can do some happy assumptions on this. i think we have time during 5.9.x until we reach 6.0 to break abi and provide such new tool with proper manpage and a working patch format for unified binary patching.

rodarima commented 6 days ago

Thanks for taking a look.

I'll need to think about the patch format to come up with a spec that makes sense first.

You can always fall back to plain hex bytes if you don't know how the data type, but the patch format should allow you to use a more human friendly format.

The problem with the flow of saving the binary and then diffing with the original is that you miss information on how the user specified the changes. It may be better to generate a patch from r2 itself when you know how those changes were made. This way you can store in the patch comments such as the instruction being changed or ASCII.

I suspect that for any w command that you do with radare2, you can always find the opposite command that would revert that change, and specify it using the same value format.

For example, wvf 3.21 over a memory location that contains the 1.23 float, could generate this patch:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ -0x00100000,4 +0x00100000,4 @@
- wvf 1.23
+ wvf 3.21

This is nice for radare2 users because they will be already familiar with the commands, but it doesn't make a lot of sense for users of other tools. It also has the problem that there is not information about the byte order. Something like this:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ -0x00100000,4 +0x00100000,4 @@
- LE (float) 1.23
+ LE (float) 3.21

May be more understandable, specially if we use known types like C.

Another issue is that in the common case, you will always use the same LE/BE for a patch (although we should support the cases when that is not true), so you don't need to pollute every line with LE/BE. It is conveninent to define a byte order at the start:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00100000,4 +0x00100000,4 @@
- (float) 1.23
+ (float) 3.21

Now, there is the case in which a user may specify integers in different bases hex/dec/octal. In my above case:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,4 +0x00189ca8,4 @@
- (uint32_t) 1799000 # Using a type that can map a constant into bytes
+ (uint32_t) 1920000 # Notice how I use decimal here

I think it may be good to use the C format for numbers too: 0123 = octal, 123 = dec, 0x123 = hex.

This may also be valid:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,12 +0x00189ca8,12 @@
- (uint32_t []) { 1799000, 1799001, 1799002 }
+ (uint32_t []) { 1920000,   0x123,      07 }

But it starts to complicate the syntax. Also, if I want to add another number, I would need to modify the 12 in the hunk header. So this may be simpler:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,uint32_t +0x00189ca8,uint32_t @@
- 1799000 # Using a type that can map a constant into bytes
+ 1920000 # Notice how I use decimal here
--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,uint32_t[3] +0x00189ca8,uint32_t[3] @@
- 1799000, 1799001, 1799002
+ 1920000,   0x123,    0777 # Notice 0777 is octal

All those cases can be mapped to the basic format, where everything is a simple hex string. I don't like to just specify an hex string that could be confused with a number. Also, I think "\x12\x23\x34\x45" contains a lot of noise.

So maybe we can use something like this:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ -0x00189ca8,char[4] +0x00189ca8,char[4] @@
- '00 1b 73 58'
+ '00 4c 1d 00'

Notice there is no byte order, as hex string don't need one. We can also probably specify that the default is LE, and only write BE when needed. There is also mixed endianness, but I think we can ignore those for now, and fall back to hex strings if needed.

This basic hex format is probably doable to be implemented in radiff2 without much effort, as we don't even need to output aligned words, just which bytes differ.

I can try to modify radiff2, but I'm not familiar with the codebase so it may take a while.

On a more advanced implementation, one could determine what type of data is placed on which addresses of a binary file, and then produce the appropriate representation in a patch when changing those bytes. Otherwise fall back to hex strings.

The hex format should allow you to split the lines as you want, so you can write instructions properly:

--- a.bin   2024-09-24 09:24:41.475235346 +0200
+++ b.bin   2024-09-24 09:24:41.475235346 +0200
@@ -0x00189ca8,char[4] +0x00189ca8,char[4] @@
- '01 46'       # mov r1, r0
- '68 46'       # mov r0, sp
+ '4f f2 ba fc' # bl 0x254526

It would be also nice if this format is a superset of the patch format, so you can also apply normal patches with rapatch2 (or even mix hunks). I think this can be easily done by using the "type" specifier of the hunk. So -0x00189ca8,char[4] specifies a binary patch of 4 bytes, while -00189308,4 specifies 4 lines at the 00189308 decimal line.

This may be useful if you have a mix of source code and blobs and you want to specify a patch to change both.

rodarima commented 6 days ago

This oneliner more or less implements the hex diff (addresses are decimal and start at 1):

% bindiff() { diff -u0p <(od -An -vtx1 -w1 $1) <(od -An -vtx1 -w1 $2) | sed '/^@@/s/,\([0-9]*\)/,char[\1]/g' }
% bindiff v2.bin v3.bin
--- /proc/self/fd/11    2024-09-26 22:10:25.813980894 +0200
+++ /proc/self/fd/13    2024-09-26 22:10:25.813980894 +0200
@@ -1612969,char[3] +1612969,char[3] @@
- 58
- 73
- 1b
+ 00
+ 4c
+ 1d
trufae commented 6 days ago

Let's discuss it in here https://hackmd.io/@BCdr4EkGSKO51w6pf-JUow/r1h5idQRC/edit