openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.09k stars 421 forks source link

avoid UB in bit shifts #630

Closed motiejus closed 1 year ago

motiejus commented 1 year ago

unsigned char* gets promoted to int, which cannot always be shifted by 24 bits.

Justine Tunney blogs about it here: https://justine.lol/endian.html

Example:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

uint32_t file_deserialize_uint32_ok(unsigned char *buf) {
    return ((uint32_t)buf[0] << 24) | ((uint32_t)buf[1] << 16) | ((uint32_t)buf[2] << 8) | (uint32_t)buf[3];
}

uint32_t file_deserialize_uint32(unsigned char *buf) {
    return (buf[0] << 24) | (buf[1] << 16) | (buf[2] << 8) | buf[3];
}

int main() {
    unsigned char arr[4] = {0xaa, 0xaa, 0xaa, 0xaa};

    printf("%d\n", file_deserialize_uint32_ok((unsigned char*)arr));
    printf("%d\n", file_deserialize_uint32((unsigned char*)arr));
}

Output:

$ clang-16 -fsanitize=undefined ./deserialize.c -o deserialize && ./deserialize
-1431655766
deserialize.c:10:20: runtime error: left shift of 170 by 24 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior deserialize.c:10:20 in 
-1431655766