pelias / libpostal-service

Dockerfile for libpostal-service based on the Who's on First implementation
MIT License
36 stars 14 forks source link

Segmentation violations with Ubuntu version #9

Open pielambr opened 1 year ago

pielambr commented 1 year ago

Describe the bug

Since updating to the latest version, which includes the bumping of the Ubuntu version, we are repeatedly getting segmentation violations - around one every 6 to 10 minutes

image

This goes away when reverting to an earlier version.

Steps to Reproduce

Expected behavior

No segmentation violations

Environment (please complete the following information):

The container is running inside of a Kubernetes cluster on Google Cloud Services

Pastebin/Screenshots

image

Additional context

References

orangejulius commented 1 year ago

Hi @pielambr, we've noticed this as well. I wouldn't be surprised if it's related to https://github.com/pelias/docker-libpostal_baseimage/pull/12. There have been some changes lately in libpostal that cause segfaults. While some may be fixed it's very likely some issues still remain.

We can try reverting to an older commit of libpostal again, stay tuned.

missinglink commented 1 year ago

I think we might be able to move back up to HEAD since https://github.com/openvenues/libpostal/pull/632 was merged, I'll try building and releasing a new docker image tomorrow.

missinglink commented 1 year ago

@pielambr do you have an example query which caused the fault which I can use to confirm the fix?

pielambr commented 1 year ago

@missinglink I'm afraid not, we just observed it in production that the pod went down quite often, usually with larger paragraphs of text.

missinglink commented 1 year ago

It seems the latest docker image already includes code from the PR I linked above.

@pielambr can you please tell me which version of the docker image you are running?

docker images
REPOSITORY                 TAG       IMAGE ID       CREATED      SIZE
pelias/libpostal-service   latest    846cd5bdb6db   9 days ago   2.3GB
missinglink commented 1 year ago

Could you please add some instrumentation to capture the query causing the segfault if possible?

From what I'm seeing here it's difficult to resolve this issue without knowing which version(s) and which query(ies) are causing it.

missinglink commented 1 year ago

After some trial and error I was able to get 846cd5bdb6db to segfault by increasing the input query length, this is the query which finally caused it to fail on my machine:

30 w 26th st, new york, ny,30 w 26th st, new york, ny,30 w 26th st, new york, ny,30 w 26th st, new york, ny

missinglink commented 1 year ago

What I'll do is to revert to the last known stable version and write up an issue on the libpostal repo to make them aware, it seems to be affecting HEAD so maybe there was a regression introduced.

missinglink commented 1 year ago

Okay, bad news, I rebuilt this image pinned to an older version of our libpostal baseimage and I was still able to trigger the segfault by sending 5 to 10 long ugly queries like the one above.

diff --git a/Dockerfile b/Dockerfile
index c91a18c..5c85161 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,5 +1,5 @@
 # build the libpostal-server binary separately
-FROM pelias/libpostal_baseimage as builder
+FROM pelias/libpostal_baseimage:pin-to-version-that-builds-2023-07-04-5f89119a11fbcce5df475eba9a3f337181d2d8ad as builder

 RUN apt-get update && apt-get install -y make pkg-config build-essential
missinglink commented 1 year ago

It's not clear when exactly the regression was introduced but I checked out an old version from 2021-11-03 and it isn't affected, so that can provide a bookend for the bisect.

I don't have loads more time to spend on this today but if someone could provide more information about which versions between master-2021-11-03-aaf0586c78acd54e4586d84e6257c56b9db99f3e and master-2023-07-23-c289dda8d47cb6d21b2a1aa74e68cb5e9d12a872 work or don't work that would be super useful to getting this resolved 🙏

docker run -d -p 4400:4400 pelias/libpostal-service:master-2021-11-03-aaf0586c78acd54e4586d84e6257c56b9db99f3e
missinglink commented 1 year ago

In fact there have been fairly few releases since 2021 due to not much activity on the upstream repos:

Screenshot 2023-08-02 at 13 34 49
pielambr commented 1 year ago

If I have some spare time, I'll have a look at what version introduced it for us, but that might be a while. We currently reverted all the way to version ca4ffcc just to make sure, because it was blocking production.

mreid-exiger commented 1 year ago

Hi folks, I am seeing this issue as well. I've tried the following images:

master-2023-07-23-c289dda8d47cb6d21b2a1aa74e68cb5e9d12a872 <- crash master-2023-07-16-d6483672db70596a2ee0d97782567b12917c6ae6 <- crash master-2023-07-04-b02f6f14cfe2dbf2dfee9e458a372f0aca13caa4 <- no crash master-2021-11-03-aaf0586c78acd54e4586d84e6257c56b9db99f3e <- no crash

I haven't done a huge amount of testing, but the crash is pretty easy to reproduce, occurring after about ~500 requests. The 2023-07-04 image appears to be the latest that is holding up for 1000s of requests in my environment (Kubernetes with 4Gi mem limit).

missinglink commented 1 year ago

Thanks for the continued reports, they are helpful to discover which versions are affected.

These memory issues are being discussed over on the main libpostal issue tracker and we hope to adopt the patches as soon as they are available.

We would be happy to accept some code in this repo which could reliably cause the CI to crash (and therefore docker images not created) so that no new releases could be generated until it is fixed upstream.

missinglink commented 1 year ago

@mreid-exiger are you sure that https://hub.docker.com/layers/pelias/libpostal-service/master-2021-11-03-aaf0586c78acd54e4586d84e6257c56b9db99f3e/images/sha256-ca4ffcc5e2f8d415b6d72cce6c32ebfcf7142648b337e8c68a13f2827708e62a?context=repo is crashing?

mreid-exiger commented 1 year ago

@missinglink perhaps my message was formatted a little confusingly. The crashing images that I've tested are:

The images which I've tested that appear stable are:

missinglink commented 1 year ago

Got it thanks 👍