oracle / linux-uek

Oracle Linux UEK: Unbreakable Enterprise Kernel
https://blogs.oracle.com/linuxkernel
311 stars 69 forks source link

"Reserved VA" feature fails with "Exec format error" when encountering ELF file with PT_NOTE program header with filesz==0 #24

Open MarkMielke opened 7 months ago

MarkMielke commented 7 months ago

A third party build of Python 3.10 is failing on Oracle UEK R6 and Oracle UEK R7:

-bash-4.2$ uname -r
5.4.17-2136.330.7.1.el7uek.x86_64
-bash-4.2$ ./python3.10
-bash: ./python3.10: cannot execute binary file
bash-4.4$ uname -r
5.15.0-204.147.6.2.el8uek.x86_64

bash-4.4$ ./python3.10
bash: ./python3.10: cannot execute binary file: Exec format error

The same program seems to work fine on Red Hat Linux kernel 3.10 and 4.18, as well as a custom compile of Linux kernel 6.1 and Linux kernel 6.6.

After a deep dive, I found that the issue seems to be due to the "Reserved VA" feature introduced with this commit:

commit a48aa31a29ea85c9c08d88ac9adb1cb07b7ee670
Author: Khalid Aziz <khalid.aziz@oracle.com>
Date:   Thu Aug 8 14:01:10 2019 -0600

    mm: Allow userspace to reserve VA range for use by userspace only
    ...

For whatever reason - the python 3.10 binary has p_offset == 0, p_filesz == 0, and p_memsz == 0. They have a python 3.9 binary which is ok, and a python 3.10 binary which is not ok. The python 3.9 is "not stripped" while the python 3.10 is "stripped". This makes me suspect that their version of binutils had a bug, and something like "strip" corrupted the PT_NOTE program header. I have not been able to reproduce the corruption yet. I only know that the corruption exists:

bash$ od -A d -t x1 --skip-bytes 400 --read-bytes 56 python3.10
0000400 04 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00
0000416 54 02 40 00 00 00 00 00 54 02 40 00 00 00 00 00
0000432 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000448 04 00 00 00 00 00 00 00
0000456

The Oracle UEK R6 and Oracle UEK R7 both parse the elf headers and fail with ENOEXEC if they encounter this file. The Red Hat Linux kernel and the upstream kernel, does not implement this parsing, and does not fail.

I was able to patch the ELF header in-place, correcting the p_offset to be the offset into the file of the first note, and p_filesz/p_memsz to be the offset past the second note:

bash$ od -A d -t x1 --skip-bytes 400 --read-bytes 56 python3.10-fixed
0000400 04 00 00 00 04 00 00 00 78 07 00 00 00 00 00 00
0000416 54 02 40 00 00 00 00 00 54 02 40 00 00 00 00 00
0000432 44 00 00 00 00 00 00 00 44 00 00 00 00 00 00 00
0000448 04 00 00 00 00 00 00 00
0000456 

With the python3.10-fixed, UEK R6 and UEK R7 both work again:

-bash-4.2$ uname -r
5.4.17-2136.330.7.1.el7uek.x86_64
-bash-4.2$ ./python3.10-fixed
Python 3.10.8 (main, Nov 24 2022, 16:36:59) [GCC 8.2.1 20180905 (Red Hat 8.2.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
bash-4.4$ uname -r
5.15.0-204.147.6.2.el8uek.x86_64
bash-4.4$ ./python3.10-fixed
Python 3.10.8 (main, Nov 24 2022, 16:36:59) [GCC 8.2.1 20180905 (Red Hat 8.2.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

I believe the problem code is here (fs/binfmt_elf.c):

static int get_elf_notes(struct linux_binprm *bprm, struct elf_phdr *phdr, char **notes, size_t *notes_sz)
{
        char *data;
        size_t datasz;
        int ret;

        if (!phdr)
                return 0;

        datasz = phdr->p_filesz;
        if ((datasz > MAX_FILE_NOTE_SIZE) || (datasz < sizeof(struct elf_note)))
                return -ENOEXEC;
        …
}

If p_filesz==0, then it will not be > MAX_FILE_NOTE_SIZE, but it will be < sizeof(struct elf_note), therefore it will fail with ENOEXEC.

I believe it makes sense to fail (not necessarily ENOEXEC, but perhaps) if the ELF data structure appears to be corrupt in such a way that the parsing is inconclusive as to what is intended. If the notes segment has an unexpectedly short p_filesz, then it is not possible to read the header fields to make further decisions. However, p_filesz==0 is a special case.

I do think that ENOEXEC is asking for trouble. If non-UEK kernels are happy with the files - corrupt, or non-compliant, or some situation not tested for, but UEK kernel fails, this seems to not pass the robustness principle to be liberal with what you will receive, but conservative with what you produce. In this case, I don't want to have to switch back to Red Hat for the users of this tool, who are not users of the "Reserved VA" extension. This is essentially a robustness issue.

If I look at binutils, glibc, and other parsers of ELF, it is common to ignore segments that have p_filesz==0. Commit history suggest that attempts have been made to make it stricter in the past, but they were relaxed due to real-life situations where p_filesz==0 is possible.

binutils has code like this:

  /* Read in program headers and parse notes.  */
  for (i = 0; i < i_ehdr.e_phnum; ++i, ++i_phdr)
    {
      ...
      if (i_phdr->p_type == PT_NOTE && i_phdr->p_filesz > 0)
      ...

There could be other bits of the "Reserved VA" code in UEK R6 and UEK R7 that also error if p_filesz == 0. I found the one that seemed most suspect.

There could be other issues, although patching only the p_offset / p_filesz / p_memsz in the file seemed to resolve the issue for me.

I did notice a second issue, but so far it doesn't seem to have caused an issue:

        off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
                       ELF_GNU_PROPERTY_ALIGN);
        if (off > n)
                return -ENOEXEC;

The parse_elf_properties function in UEK fs/binfmt_elf.c seems to presume a fixed / constant sized ELF_GNU_PROPERTY_ALIGN based upon the target platform. 4 bytes on 32-bit system, and 8 bytes on 64-bit system. However, Linux seems to have a well established extension whereby it is acceptable to use either 4 byte alignment or 8 byte alignment for program segments and sections, and it depends upon the section s_addralign value where s_addralign <= 4 will use a value of 4, and s_addralign == 8 will use a value of 8. These section then have to be contained within program segments that have p_align that matches. In the case of notes, it is common to have two PT_NOTE segments - one with p_align == 4 and one with p_align == 8.

Here is a reference for the last issue, fixed upstream in binutils:

https://sourceware.org/bugzilla/show_bug.cgi?id=22444

tvierling commented 7 months ago

Thank you for the outstanding report and analysis! We're now tracking this bug internally and will close this issue when it's fixed in a known UEK version.