skeeto / bf-x86

x86_64 brainfuck compiler
The Unlicense
36 stars 4 forks source link

Code generated by lost kingdom crashes after first prompt #2

Open bradenbest opened 2 months ago

bradenbest commented 2 months ago

Here is a link to a copy of lost kingdom: https://github.com/rdebath/LostKingdom

Doing a test run with Lost Kingdom (lostkng.bf), a 2MB brainfuck program and possibly the largest brainfuck program ever written, works just fine with interpreter mode, but when compiling, the executable does not list any disassembly under objdump -D and running the program produces a segmentation fault after answering the first prompt. This lack of parity between the interpreted and compiled run of the same brainfuck code indicates a bug in the implementation, though I couldn't say where it's going wrong.

Alas, without debug symbols, there's not much I can give you.

image

The code is remarkably small at 700 lines. I can try to look at it myself and study the relevant x86 op codes to see if I can help, but with my lack of knowledge on the ELF format and x86 instruction set, I can't make any promises.

Based on what valgrind's saying (though whether this is actually in scope for valgrind, I'm unsure), I'd speculate that maybe the interpreter dynamically allocates memory while the compiler has a fixed memory region that is too small. I can confirm that lost kingdom works with a static memory region of size 32768.

skeeto commented 1 month ago

Thanks for reporting this! It should be fixed by 6becc99b. On a quick skim through code generation I spotted the [-256, +256] range check, which set off alarms. That makes no sense! From the debug listing (-D), I confirmed that LostKingdom has large "MOVE" operands, and they're encoded as though they won't be sign extended, resulting in the bug. The correct encoding is simpler anyway, per the commit.

The code buffer is never extended, and I had assumed 4MiB would be enough. However, with no optimization (-O0), LostKingdom is over 8MiB, so it was crashing then I disabled optimization. With that "fixed" by extending it to 256MiB, the program worked fine at -O0 — it didn't need to encode large MOVE ops. So this was a secondary bug.

I don't know what I was thinking back in 2015 with some of this, but it's fair to say I've gotten a lot better at this stuff in the past 9 years!

bradenbest commented 1 month ago

Very nice work, and fast, too! I did also find the optimizations slightly questionable, but I couldn't formulate a program to break them. I tried this for example

[add 0x30 "0"]
+++++ +++++ +++++ +
+++++ +++++ +++++ +
+++++ +++++ +++++ +

> +++++ +++++
> +
<<[->>++<<]

[expect "a", as 0x30 x 2 + 1 is 0x61]
>>.
<.

and the program correctly prints "a" with a newline instead of breaking or writing the number 0 or something. Maybe I just don't understand what the "add" operator is supposed to do, because I had assumed seeing mutate -1, move 2, mutate 2, move -2 would have to result in a more complex set of instructions and reasoning to work out that the intent here is to take the value of cell 1, multiply it by 2 (mutate 2), and add it to cell 2.

On that note, there is one optimization that my bytecode compiler implementation does (or the included 'bfmin' program does) that I don't see in yours and is worth considering: dead loops.

Consider the following code: [>+>++<<-][This is a comment and it has bf instructions in it, like these.]. The second loop occurs immediately after the loop before it with no ,<>-+ instructions inbetween, but in order for execution to get there, the [ has to branch past the ], meaning the cell is zero, meaning it's guaranteed to never run. Basically there are two times in a brainfuck program when the current cell can be safely assumed to be zero during static analysis, and those times are at the beginning of the program, and right after a loop (after ]). As long as a mutating instruction doesn't appear during that state, the value of the cell is zero and any [...] can be safely discarded from the program.

Some other patterns worth investigating are [<] and [>], which seek left/right until the first cell that is 0 (for example, you can set up a zero terminated string in memory, go to the beginning of it, and print the whole string out with [.>]). I don't know that there is an instruction for setting a pointer to the first byte that has a zero value, but knowing x86, it's probably worth mentioning.

skeeto commented 1 month ago

but I couldn't formulate a program to break them

Apologies for the crudeness — my bf is super rusty — but here's a simple example:

$ python -c 'print("+[" + ">"*128 + "+"*65 + "." + "<"*127 + "[]<-]")' >example.bf

This should print one A then halt, but before my fix, the x86-compiled version would print A repeatedly until it crashes.

$ ./bf-x86 -i example.bf >/dev/null  # byte-code: ok
$ ./bf-x86 -e example.bf >/dev/null  # x86 JIT: crash

The problem is that move-right by 128 turns is erroneously encoded as move left by 128 ("add -128") because the immediate is signed. This wanders out of bounds, though not enough to reach unmapped memory. The move-left is only 127, which encodes properly, followed by an empty loop to break optimization (i.e. no coalescing adjacent moves), and then one last move-left before decrementing the loop counter and looping. So in the x86-compiled program it keeps moving left until it crashes.

Unfortunately my system's Binutils 2.40 is broken and does not correctly disassemble programs built by my bf-x86, but LLVM's version can do it:

$ llvm-objdump-16 -D example
0000000000400078 <PT_LOAD#0>:
...
  4000a6: 48 83 c6 80                   addq    $-0x80, %rsi
...
  4000b5: 48 83 ee 7f                   subq    $0x7f, %rsi
  4000b9: 38 1e                         cmpb    %bl, (%rsi)
  4000bb: 0f 84 05 00 00 00             je      0x4000c6 <PT_LOAD#0+0x4e>
  4000c1: e9 f3 ff ff ff                jmp     0x4000b9 <PT_LOAD#0+0x41>
  4000c6: 48 83 ee 01                   subq    $0x1, %rsi
...

The rsi register is the data pointer, and these instructions map onto the ">"*128, "<"*127, and "[]<" parts in the Python metaprogram. Note the addq $-128, %rsi. Oops!

I don't see in yours and is worth considering: dead loops.

Hmm, my initial thought was that dead loops are unusual enough that optimizing them out would probably have little impact, but the situation here has enlightened me. Removing them may enable additional optimizations because it's a peephole optimizer. Loops as comments, as you do in your example, disrupts optimization across such comments, so that's a real use case, too.

Thanks for the feedback!

bradenbest commented 1 month ago

Apologies for the crudeness

I didn't detect any negative attitude, you're good (edit: it occurs to me that you probably were referring to the bf code as crude, still, totally understandable). It's expected that you as the person who implemented would be better equipped to come up with programs that demonstrate a bug, just as I came up with programs to rigorously test, or do stuff that is only possible in mine. For example, a bf program that performs arbitrary code execution (intended behavior) and a bf program that demonstrates a bug via an ace exploit (bug)

Thanks for the feedback!

👍 Edit: The dead loop removal is also so simple that it can be done at the source level, before it's translated to instructions. I don't remember if I mentioned that, but I imagine trying to implement it as a fixed-length sequence/replacement would be nontrivial since anything can be between the brackets.