Hi maintainers,

I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/ During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.

The small example program I used is the following:

#include "utf8.h"

int
main(int argc, char **argv)
{

        char arr[10];
        klee_make_symbolic(arr, 10, "arr");
        klee_assume(arr[9] == '\0');

        char arr1[10];
        klee_make_symbolic(arr1, 10, "arr1");
        klee_assume(arr1[9] =='\0');

        void *arr_check = utf8valid(arr);
        void *arr1_check = utf8valid(arr1);
        if (arr_check != 0 && arr1_check != 0)
        {
                if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
                        return 1;
                return 0;
        }
        return 1;
}

The calls to klee_make_symbolic triggers KLEE to consider the values in the arr and arr1 buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr and arr1 to be of certain characters, and then the execution of utf8ncasecmp will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.

If my code snippet above uses your library in an erroneous manner then please disregard the bugs.

Bugs:

Bug 1

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 utf8codepoint at ./utf8.h:987

1 utf8ncasecmp at ./utf8.h:507

Bug 2

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:992

    #1 in utf8ncasecmp at ./utf8.h:507

Bug 3

arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:987

    #1 in utf8ncasecmp  at ./utf8.h:507

Bug 4

arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00" arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8codepoint at ./utf8.h:984

    #1 in utf8ncasecmp at ./utf8.h:507

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 6

arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00" arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:494

Bug 7

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:468

Bug 8

arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00" arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00" Type: memory-out-of-bound: Stack trace:

000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:470

Bug 10

arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00" arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:481

Bug 11

arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00" arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00" Type: memory-out-of-bound: Stack trace:

0 in utf8ncasecmp at ./utf8.h:496

sheredom / utf8.h

Some minor overflow bugs #70

Bugs:

Bug 1

0 utf8codepoint at ./utf8.h:987

1 utf8ncasecmp at ./utf8.h:507

Bug 2

0 in utf8codepoint at ./utf8.h:992

Bug 3

0 in utf8codepoint at ./utf8.h:987

Bug 4

0 in utf8codepoint at ./utf8.h:984

Bug 5

0 in utf8ncasecmp at ./utf8.h:494

Bug 5

0 in utf8ncasecmp at ./utf8.h:494

Bug 6

0 in utf8ncasecmp at ./utf8.h:494

Bug 7

0 in utf8ncasecmp at ./utf8.h:468

Bug 8

000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

0 in utf8ncasecmp at ./utf8.h:470

Bug 10

0 in utf8ncasecmp at ./utf8.h:481

Bug 11

0 in utf8ncasecmp at ./utf8.h:496