openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Memory leak / double free in address_parse_normalize_string #160

Closed brianmacy closed 7 years ago

brianmacy commented 7 years ago

==26728== 1,077 bytes in 32 blocks are definitely lost in loss record 1,344 of 2,063 ==26728==    at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==26728==    by 0x4FCCE938: utf8proc_map (utf8proc.c:610) ==26728==    by 0x4FCD35F1: normalize_string_utf8 (normalize.c:35) ==26728==    by 0x4FCD366E: normalize_string_latin (normalize.c:57) ==26728==    by 0x4FCDAF1C: address_parser_normalize_string (address_parser.c:153) ==26728==    by 0x4FCDAF1C: address_parser_parse (address_parser.c:812) ==26728==    by 0x4FCB6D8B: parse_address (libpostal.c:1043)

utf8proc_map uses realloc and parse_address has no way to deal with the pointer changing and the original being freed. If the pointer does change then parse_address has freed the address passed in without the client knowing and the new pointer is never freed.

Below is a potential fix.

diff --git a/src/address_parser.c b/src/address_parser.c index 187bada..253b182 100644 --- a/src/address_parser.c +++ b/src/address_parser.c @@ -809,10 +809,12 @@ address_parser_response_t address_parser_parse(char address, char *language, c return NULL; }

albarrentine commented 7 years ago

Thanks for reporting. This version frees up the memory without making an additional copy: https://github.com/openvenues/libpostal/commit/38c6c26146762f964d06ab15667ab50c2dcbbf90.

The use of realloc in utf8proc_map itself is not a problem per se because the function takes a double-pointer utf8proc_uint8_t **dstptr as an argument, so it can reset the pointer that was passed in if say the caller preallocated some memory and the function needed more. When we call it in libpostal, the initial pointer is NULL (realloc on NULL behaves like malloc), so it's always either going to return NULL or a pointer to a new string.

brianmacy commented 7 years ago

My first patch was similar in logic. Here is the issue I believe you still have... normalize_string_latin returns a non-NULL pointer in either of 2 cases:

In the case of have_utf8proc_options, it does indeed return the realloc'd pointer which is changed or not. So in one case it returns the pointer to the original address owned by the caller of parse_address. In the other case, it freed that pointer and returns a new one.

I believe the code you added will still allow utf8proc_map to free the client's buffer... potentially causing heap corruption. If it doesn't do that then it is still possible that pointer you are deleting is the client's buffer.

Or there is a third option, that I'm missing something but valgrind seems to agree.

brianmacy commented 7 years ago

Ah... nvm. I understand now. So the issue still occurs with this option. The code modifies the passed in string (the client buffer) and then returns a pointer to it. That is the pointer I believe you are deleting.

if (options & NORMALIZE_STRING_REPLACE_HYPHENS) {
    string_replace(str, '-', ' ');
    normalized = str;
}
brianmacy commented 7 years ago

I didn't want to do it this way but it may be the best option...

diff --git a/src/address_parser.c b/src/address_parser.c index 187bada..b2e836a 100644 --- a/src/address_parser.c +++ b/src/address_parser.c @@ -895,10 +895,15 @@ address_parser_response_t address_parser_parse(char address, char *language, c

         token_array_destroy(tokens);
         tokenized_string_destroy(tokenized_str);
albarrentine commented 7 years ago

With the current implementation of string_trim, that would free the caller's buffer if there are leading spaces in the input (thinking more of the expand_address calls, as the options used by the parser are more limited). The string_trim implementation, as well as that of string_replace (also currently modifies the pointer in-place, although it does not change the pointer's location) make copies in the upcoming parser-data release, and as such the functions in normalize.c clean up after all of their potential internal copies.

To reiterate re: utf8proc_map, it does not write to the input buffer. The pointer it's realloc-ing is not str (which is only read from), but uint8_t *utf8proc_normalized which is initialized as NULL and passed in as a double pointer so utf8proc_map can set/reset what it points to (if the user passed in a pre-allocated, non-NULL buffer, which we do not, and as noted before realloc-ing NULL is essentially just malloc).

Also, note that str in normalize_string_utf8 is not a double-pointer, so the effect of setting it is limited to the function's scope. The goal here is to chain several function calls depending on the options passed in, so we set str to the output of the last function for the next function to work on.

In my tests the pushed commits fixed all valgrind errors in the command-line client for a variety of inputs. Is this hypotethical or is there some input string that's causing this error for you?

brianmacy commented 7 years ago

Ok. Great. I believe you do need to add a free before the return in the nested if right above it too.

On Feb 9, 2017, 11:59 PM -0500, Al Barrentine notifications@github.com, wrote:

With the current implementation of string_trim, that would free the caller's buffer if there are leading spaces in the input (thinking more of the expand_address calls, as the options used by the parserhttps://github.com/openvenues/libpostal/blob/ea168279bd5a0040e10fcbe3bf854d9a113690e7/src/address_parser.h#L61 are more limited). The string_trim implementation, as well as that of string_replace (also currently modifies the pointer in-place, although it does not change the pointer's location) make copies in the upcoming parser-datahttps://github.com/openvenues/libpostal/blob/parser-data/src/string_utils.c release, and as such the functions in normalize.chttps://github.com/openvenues/libpostal/blob/parser-data/src/normalize.c clean up after all of their potential internal copies.

To reiterate re: utf8proc_map, it does not write to the input buffer. The pointer it's realloc-ing is not str (which is only read from), but uint8_t *utf8proc_normalized which is initialized as NULL and passed in as a double pointer so utf8proc_map can set/reset what it points to (if the user passed in a pre-allocated, non-NULL buffer, which we do not, and as noted before realloc-ing NULL is essentially just malloc).

Also, note that str in normalize_string_utf8 is not a double-pointer, so the effect of setting it is limited to the function's scope. The goal here is to chain several function calls depending on the options passed in, so we set str to the output of the last function for the next function to work on.

In my tests the pushed commits fixed all valgrind errors in the command-line client for a variety of inputs. Is this hypotethical or is there some input string that's causing this error for you?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/openvenues/libpostal/issues/160#issuecomment-278858108, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AXzs1LStmEx-ieQYehLfoWdT6cVZYQv_ks5ra-6ugaJpZM4L8KjA.

albarrentine commented 7 years ago

Only need to free memory if more than one allocation takes place internally. If the options were simply NORMALIZE_STRING_TRIM, only string_trim gets called and allocates memory, so only one copy would be made in normalize_string_utf8, and that gets returned to be freed by the caller.