sheredom / utf8.h

📚 single header utf8 string functions for C and C++
The Unlicense
1.71k stars 122 forks source link

Some minor overflow bugs #70

Closed DavidKorczynski closed 4 years ago

DavidKorczynski commented 4 years ago

Hi maintainers,

I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/ During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.

The small example program I used is the following:

#include "utf8.h"

int
main(int argc, char **argv)
{

        char arr[10];
        klee_make_symbolic(arr, 10, "arr");
        klee_assume(arr[9] == '\0');

        char arr1[10];
        klee_make_symbolic(arr1, 10, "arr1");
        klee_assume(arr1[9] =='\0');

        void *arr_check = utf8valid(arr);
        void *arr1_check = utf8valid(arr1);
        if (arr_check != 0 && arr1_check != 0)
        {
                if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
                        return 1;
                return 0;
        }
        return 1;
}

The calls to klee_make_symbolic triggers KLEE to consider the values in the arr and arr1 buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr and arr1 to be of certain characters, and then the execution of utf8ncasecmp will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.

If my code snippet above uses your library in an erroneous manner then please disregard the bugs.

Bugs:

Bug 1

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 utf8codepoint at ./utf8.h:987

1 utf8ncasecmp at ./utf8.h:507

Bug 2

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:992

    #1 in utf8ncasecmp at ./utf8.h:507

Bug 3

arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:987

    #1 in utf8ncasecmp  at ./utf8.h:507

Bug 4

arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:984

    #1 in utf8ncasecmp at ./utf8.h:507

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 6

arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00" arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 7

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:468

Bug 8

arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00" arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00" Type: memory-out-of-bound: Stack trace:

000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:470

Bug 10

arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:481

Bug 11

arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00" arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:496

sheredom commented 4 years ago

So I've done some digging into this, and I think you are using the function utf8valid wrongly - it returns 0 on success, and the pointer to the offending codepoint on failure.

Honestly this is a bit gnarly though - in hindsight I probably would have made the function return true/false, and have an optional 'codepoint that failed' arg.