openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

Segmentation Fault #639

Closed wadagso-gertjaap closed 1 year ago

wadagso-gertjaap commented 1 year ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Netherlands


Here's how I'm using libpostal

Using libpostal through a rust binding to sanitize addresses being ingested from various data sources.


Here's what I did

Ran my import, the data source being ingested while the segmentation fault is happening, is the golden copy of the LEI dataset from gleif.org.


Here's what I got

Segmentation fault happened while running the software using the libpostal built from current master. The segmentation fault happens at random points during the ingestion between 100k and 600k rows into the file. When logging the address being parsed, it's different addresses all the time, no obvious similarity between the addresses being parsed at the time of the segfault.

Ran it using gdb to get the backtrace:

Thread 1 "<binary>" received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513
513     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513
#1  0x00007ffff765fa9c in memcpy (__len=35200, __src=0x5555727ac480, __dest=0x55557279e300) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:29
#2  _aligned_realloc (alignment=32, size=35200, p=0x5555727ac480) at /home/admin/libpostal/src/vector.h:34
#3  double_matrix_resize_aligned (alignment=32, n=400, m=11, self=0x55555c8bece0) at /home/admin/libpostal/src/matrix.h:447
#4  crf_context_set_num_items (self=self@entry=0x555555e6c340, T=T@entry=11) at crf_context.c:140
#5  0x00007ffff765e49a in crf_tagger_score (print_features=false, tokenized=0x5555727821c0, feature_function=0x7ffff765c140 <address_parser_features>, prev_tag_features=0x55555c8d0180, features=0x555559dde520, tagger_context=0x555559ddde60, 
    tagger=0x555559ddd590, self=0x555555e6c9f0) at crf.c:22
#6  crf_tagger_score (self=0x555555e6c9f0, tagger=0x555559ddd590, tagger_context=0x555559ddde60, features=0x555559dde520, prev_tag_features=0x55555c8d0180, feature_function=0x7ffff765c140 <address_parser_features>, tokenized=0x5555727821c0, 
    print_features=false) at crf.c:15
#7  0x00007ffff765e932 in crf_tagger_score_viterbi (self=0x555555e6c9f0, tagger=<optimized out>, tagger_context=<optimized out>, features=<optimized out>, prev_tag_features=<optimized out>, feature_function=<optimized out>, tokenized=0x5555727821c0, 
    score=0x7fffffff4960, print_features=false) at crf.c:101
#8  0x00007ffff765ea11 in crf_tagger_predict (self=0x555555e6c9f0, tagger=tagger@entry=0x555559ddd590, context=context@entry=0x555559ddde60, features=features@entry=0x555559dde520, prev_tag_features=<optimized out>, labels=labels@entry=0x555572642e50, 
    feature_function=0x7ffff765c140 <address_parser_features>, tokenized=0x5555727821c0, print_features=false) at crf.c:120
#9  0x00007ffff765c23a in address_parser_predict (self=self@entry=0x555559ddd590, context=context@entry=0x555559ddde60, token_labels=token_labels@entry=0x555572642e50, feature_function=feature_function@entry=0x7ffff765c140 <address_parser_features>, 
    tokenized_str=tokenized_str@entry=0x5555727821c0) at address_parser.c:1649
#10 0x00007ffff765c46c in address_parser_parse (address=<optimized out>, language=0x0, country=0x0) at address_parser.c:1785
#11 0x00007ffff763618b in libpostal_parse_address (address=<optimized out>, options=...) at libpostal.c:275
#12 0x0000555555711c74 in postal::Context::parse_address ()

The segmentation fault points to a recently changed portion of code:

https://github.com/openvenues/libpostal/commit/7bdcf96c9d9c61811ffd4570ba9fbbac5ffd237f#diff-f1eba6039f610bc1556081d2a021f23b672a7648d402f2409b88d06c999d3cd2R34

When I revert to commit dc794b1b644269adee61402f713a5aea4d6a1584 and rebuild libpostal, this segmentation fault does not happen.


Here's what I was expecting

No segfault


For parsing issues, please answer "yes" or "no" to all that apply.

N/A


Here's what I think could be improved

Seems that the memalign fix introduced some regression - needs a further look.

albarrentine commented 1 year ago

I see it, memcpy needs to be using the old pointer size, otherwise it can read from memory not belonging to the process and cause a segfault (this was technically still possible in the previous implementation but would only occur in cases where realloc did not return aligned memory, which many modern systems that people use may implement, it's just not guaranteed by the standard). Made a slight change to the internals where we just define a new function for resizing aligned memory instead of trying to mimic Windows' _aligned_realloc (which is the same API as realloc). The new function just takes the size of the old memory, which the caller always knows in all cases since these are vectors/matrices of known dimension. Alternatively, it could also be implemented with malloc_size (Mac) and malloc_usable_size (Linux) but those are non-standard and not sure if they work on absolutely every platform people are using out there, so simpler to just pass explicitly and on Windows can just pass through to _aligned_realloc.

Can you try with this branch? https://github.com/openvenues/libpostal/tree/fix_aligned_resize

Tests pass but need to run it against longer sequences with SSE turned on to see if the issue remains.

albarrentine commented 1 year ago

@wadagso-gertjaap reopen if issue persists but that should fix it