Closed wadagso-gertjaap closed 1 year ago
I see it, memcpy needs to be using the old pointer size, otherwise it can read from memory not belonging to the process and cause a segfault (this was technically still possible in the previous implementation but would only occur in cases where realloc did not return aligned memory, which many modern systems that people use may implement, it's just not guaranteed by the standard). Made a slight change to the internals where we just define a new function for resizing aligned memory instead of trying to mimic Windows' _aligned_realloc (which is the same API as realloc). The new function just takes the size of the old memory, which the caller always knows in all cases since these are vectors/matrices of known dimension. Alternatively, it could also be implemented with malloc_size
(Mac) and malloc_usable_size
(Linux) but those are non-standard and not sure if they work on absolutely every platform people are using out there, so simpler to just pass explicitly and on Windows can just pass through to _aligned_realloc.
Can you try with this branch? https://github.com/openvenues/libpostal/tree/fix_aligned_resize
Tests pass but need to run it against longer sequences with SSE turned on to see if the issue remains.
@wadagso-gertjaap reopen if issue persists but that should fix it
Hi!
I was checking out libpostal, and saw something that could be improved.
My country is
Netherlands
Here's how I'm using libpostal
Using libpostal through a rust binding to sanitize addresses being ingested from various data sources.
Here's what I did
Ran my import, the data source being ingested while the segmentation fault is happening, is the golden copy of the LEI dataset from gleif.org.
Here's what I got
Segmentation fault happened while running the software using the
libpostal
built from current master. The segmentation fault happens at random points during the ingestion between 100k and 600k rows into the file. When logging the address being parsed, it's different addresses all the time, no obvious similarity between the addresses being parsed at the time of the segfault.Ran it using
gdb
to get the backtrace:The segmentation fault points to a recently changed portion of code:
https://github.com/openvenues/libpostal/commit/7bdcf96c9d9c61811ffd4570ba9fbbac5ffd237f#diff-f1eba6039f610bc1556081d2a021f23b672a7648d402f2409b88d06c999d3cd2R34
When I revert to commit dc794b1b644269adee61402f713a5aea4d6a1584 and rebuild
libpostal
, this segmentation fault does not happen.Here's what I was expecting
No segfault
For parsing issues, please answer "yes" or "no" to all that apply.
N/A
Here's what I think could be improved
Seems that the memalign fix introduced some regression - needs a further look.