openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Memory leak in javacpp-presets-libpostal libpostal_expand_address method #641

Open orividrisk opened 1 year ago

orividrisk commented 1 year ago

Hi, When using the javacpp-presets-libpostal libpostal 1.1-1.5.8 I experience a memory leak in my java process. My code is iterating on a large number of addresses, and over time the java process memory is increasing until it crashes due to OOM. I have been monitored the jvm heap and non heap memory metrics and there is no issue there. This leads me to the conclusion that the issue is with the native allocation of the returned address response, which is not deallocating the parsed addresses. I created an issue in the javacpp-presets repo (mentioned in the libpostal readme), where they replied that it could be a potential libpostal c library memory leak: https://github.com/bytedeco/javacpp-presets/issues/1407

The following is my code (I am using the default options provided by the library):

try (PointerScope pointerScope = new PointerScope();
             libpostal_address_parser_response_t response = libpostal_parse_address(addressStr, options)) {

            long count = response.num_components();
            Map<String, String> fields = new HashMap<>();

            for (int i = 0; i < count; i++) {
                fields.put(response.labels(i).getString(), response.components(i).getString());
            }

            libpostal_address_parser_response_destroy(response);
            return fields;
 }

image

If any more information is needed I would be happy to provide it. I would very appreciate your prompt assistance with the issue as it is having a big impact on our production env.

Thank you and have a very nice day.

orividrisk commented 1 year ago

@saudet FYI

albarrentine commented 1 year ago

Ran a few parser tests under valgrind with no memory leaks found on this side.

Can you write the above test program in C, compile it, and run under valgrind --leak-check=full to see if any leaks can be reproduced without roping in any of the JNI bindings?

Not clear on how javacpp-presets interacts with the C functions (it does not use the official jpostal/JNI bindings, instead generating its own automatic JNI bindings from the C API), and whether it's using latest or a tag. If it's not using the latest version of libpostal, would recommend trying that as there were some SSE fixes recently, but none of those were related to a memory leak.

Also noting that libpostal is not thread-safe so should only be used from a single thread.

orividrisk commented 1 year ago

@albarrentine thank you for the quick response. My C is quite rusty so I don't think I will be able to provide the C equivalent of my test. I wrote a very simple program using the javacpp-presets presets, which easily reproduces the issue. This program upon exiting, produces a leak reports using the macOS leaks util( as unfortunately valgrind is no longer supported on more advanced versions of macos).

The program can be found in the following github repo , with instruction in the ReadMe file of how to run it.

In the memory leak report I found these two reoccurring stack trace leaks :

  1. Leak: 0x7fcb71281b90 size=16 zone: MallocStackLoggingLiteZone_0x104f34000 malloc in cstring_array_to_strings C libpostal.1.dylib length: 4 "unit" Call stack: 0x11627653a (???) ??? | 0x104e9130c (libjnipostal.dylib) Java_org_bytedeco_libpostal_global_postal_libpostal_1parse_1address__Lorg_bytedeco_javacpp_BytePointer_2Lorg_bytedeco_libpostal_libpostal_1address_1parser_1options_1t_2 | 0x12f57815c (libpostal.1.dylib) libpostal_parse_address | 0x12f59be9b (libpostal.1.dylib) address_parser_parse | 0x12f5904c7 (libpostal.1.dylib) cstring_array_to_strings | 0x7ff816f73a43 (libsystem_c.dylib) strdup | 0x7ff816f19149 (libsystem_malloc.dylib) _malloc_zone_malloc_instrumented_or_legacy

  2. Leak: 0x7fcb71284780 size=32 zone: MallocStackLoggingLiteZone_0x104f34000 malloc in address_parser_parse C libpostal.1.dylib Call stack: 0x7ff8170c6bd3 (libsystem_pthread.dylib) thread_start | 0x7ff8170cb1d3 (libsystem_pthread.dylib) _pthread_start | 0x104e21529 (libjli.dylib) ThreadJavaMain | 0x104e1ec4f (libjli.dylib) JavaMain | 0x1066742d7 (libjvm.dylib) jni_CallStaticVoidMethod | 0x106670e58 (libjvm.dylib) jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) | 0x10660cb26 (libjvm.dylib) JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*) | 0x116269cc9 (???) ??? | 0x10ef080dc (???) ??? | 0x1168444a0 (???) ??? | 0x104e9130c (libjnipostal.dylib) Java_org_bytedeco_libpostal_global_postal_libpostal_1parse_1address__Lorg_bytedeco_javacpp_BytePointer_2Lorg_bytedeco_libpostal_libpostal_1address_1parser_1options_1t_2 | 0x12f57815c (libpostal.1.dylib) libpostal_parse_address | 0x12f59bd25 (libpostal.1.dylib) address_parser_parse | 0x7ff816f19149 (libsystem_malloc.dylib) _malloc_zone_malloc_instrumented_or_legacy

it looks like in both cases it is related to the address_parser_parse method in the libpostal library.

@saudet for clarification could you please shed some light as to how the JNI binding works including the used libostal version, and if the stack trace above could be related to this memory leak issue?

In addition I attached bellow a java NTM report which shows that the memory leak is not relevant to the JVM managed memory space, and also the report provided by the macOS leaks util

java NMT report.txt

leak report.txt

Please let me know if you need any more information.

saudet commented 1 year ago

@orividrisk Your small example is pretty big. If you're serious about debugging this, reduce it down to only a few lines.

albarrentine commented 1 year ago

@orividrisk https://github.com/orividrisk/libpostal-poc/blob/bd583351594b729dfa2bb89c6fac8f4ffb67c37d/src/main/java/com/example/test/Address.java#L66 you’re not calling libpostal_address_parser_response_destroy to free the memory allocated in that line, which explains both 1 and 2.

orividrisk commented 1 year ago

@albarrentine adding libpostal_address_parser_response_destroy did the trick. the memory leak related to libpostal_parse_address was resolved. thank you!

In our production env we are using both address normalization & parsing which both presented with memory leak issue. In order to isolate the memory issue I started with address parsing. now that it was resolved I moved on to normalization which also present with memory leak issue.

The following is a call stack example of a leak for the macOS leaks util:

Leak: 0x60000236a580 size=64 zone: DefaultMallocZone_0x107ab9000 realloc in utf8_case C libpostal.1.dylib length: 51 "1142 bellview dr wisconsin rapids wi 54494-9293 usa" Call stack: 0x7ff8170c6bd3 (libsystem_pthread.dylib) thread_start | 0x7ff8170cb1d3 (libsystem_pthread.dylib) _pthread_start | 0x107adc529 (libjli.dylib) ThreadJavaMain | 0x107ad9c4f (libjli.dylib) JavaMain | 0x1093112d7 (libjvm.dylib) jni_CallStaticVoidMethod | 0x10930de58 (libjvm.dylib) jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) | 0x1092a9b26 (libjvm.dylib) JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*) | 0x118f06cc9 (???) ??? | 0x111ac4ae4 (???) ??? | 0x1194c628f (???) ??? | 0x107b5292c (libjnipostal.dylib) Java_org_bytedeco_libpostal_global_postal_libpostal_1normalize_1string__Lorg_bytedeco_javacpp_BytePointer_2J | 0x132592d59 (libpostal.1.dylib) normalize_string_latin_languages | 0x132592b4e (libpostal.1.dylib) normalize_string_utf8_languages | 0x13258cd30 (libpostal.1.dylib) utf8_case | 0x7ff816efcde6 (libsystem_malloc.dylib) realloc

For the normalized address I didn't see a method to "destroy" the returned normalized string.

I updated the example github repo to now perform both parsing and normalization. @saudet please let me know what you think ( I updated the example to be as concise as possible)

The README instructions in the repo can be used to reproduce the issue, with the detailed leaks instances report printed out to console after the process concludes it's run.

normalized address leak report.txt

Please let me know if you need any more information.

albarrentine commented 1 year ago

@orividrisk there's a corresponding function libpostal_expansion_array_destroy(expansions, num_expansions) which cleans up memory for that response. This is used in the expander cli program which builds with libpostal. I'm not seeing any memory leaks for that input either.