m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust
MIT License
1.18k stars 156 forks source link

goblin >= 0.4.2 fails to parse valid ELF Strtab with "bad input invalid utf8" #304

Closed tux3 closed 2 years ago

tux3 commented 2 years ago

Starting with goblin 0.4.2, Strtab::parse does the following:

let mut i = 0;
while i < result.bytes.len() {
    let string = get_str(i, result.bytes, result.delim)?;
    result.strings.push((i, string));
    i += string.len() + 1;
}

However, it appears that the contents of the strtab in a valid ELF files are NOT always valid UTF-8.
Some of the strtab entries in my ELF object look like [F4, 65, 02, 00], or [82, 66, 02, 00]. This causes get_str to fail and the ELF fail cannot be parsed.

The ELF file in question was created by GNU ld, it's a relocatable object.

Here is the `readelf -h` output for the object. Note the number of section headers. ``` ELF Header: Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - GNU ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 600695752 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 0 (274069) Section header string table index: 65535 (274068) ```

The first strtab entry (F4, 65, 02) that goblin fails to parse is at this offset into the file: 000bdc20: f465 0200 81d4 0300 54f5 0200 db28 0400 .e......T....(..

This corresponds to the following offset starting at number 65535 in the readelf --sections output:

There are 274069 section headers, starting at offset 0x23cde3c8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000042e95  0000000000000000          274068     0     0
  [ 1] .group            GROUP            0000000000000000  00000040
       0000000000000010  0000000000000004          197859   266747     4
<... 65k sections snipped ... >
  [65534] .group            GROUP            0000000000000000  000bdc10
       000000000000000c  0000000000000004          197859   258306     4
  [65535] .group            GROUP            0000000000000000  000bdc1c
       0000000000000014  0000000000000004          197859   332560     4
  [65536] .group            GROUP            0000000000000000  000bdc30
       000000000000000c  0000000000000004          197859   315046     4
<...snip...>

readelf -s shows the following:

<... 65k sections snipped ...>
 65534: 0000000000000000     0 SECTION LOCAL  DEFAULT 65534 .group
 65535: 0000000000000000     0 SECTION LOCAL  DEFAULT 65535 .group
 65536: 0000000000000000     0 SECTION LOCAL  DEFAULT 65536 .group
<...snip...>

I don't know this particular corner of the ELF spec very well at all, but I believe something special must be happening when there are more than 2^15 section headers, and the strtab contents may not always be valid UTF-8 strings.

Sometimes this bug fails to reproduce, even though the number of section headers is far above 65k. I believe this is because the 3 bytes of binary data in the strtab entries can accidentally happens to be valid UTF-8. This may have been why I didn't run into this bug before today.

I can upload the object file if that would help (though it is from a large statically linked binary compiled in debug and 590M big).

philipc commented 2 years ago

I think there's two bugs:

  1. the strtab contents may not be valid UTF-8 strings
  2. when e_shstrndx == SHN_XINDEX, the index of the section header string table should be obtained from the sh_link field of the section header at index 0

I suspect that for the file you are parsing, the first bug is only occurring as a result of the second bug. That is, we're using the wrong section index (65535) for the string table, and as a result we're trying to parse the contents of a .group section as UTF-8 strings. So if you fix the second bug, the UTF-8 parsing will no longer be a problem for this file (but it might be a problem for other files).

The code that needs fixing: https://github.com/m4b/goblin/blob/3f5f70e0e68243559f6449bd9ad3517be2c206d0/src/elf/mod.rs#L291-L292

There's possibly other code that needs fixing to handle large numbers of sections too (e.g. e_shnum and st_shndx can overflow).

Here's an example of better e_shstrndx parsing: https://github.com/gimli-rs/object/blob/c4760714aa9ca6f73cd5e76991463ed1e3497589/src/read/elf/file.rs#L582-L600