Valhalla memory leak - Githubissues

elliveny commented 2 years ago

I'm seeing what appears to be a memory leak whilst running docker-valhalla under the AWS Elastic Container Service, serving regular map match (trace_attributes) requests. Upping the available memory only delays the eventual tear down of the container task due to memory being exhausted (tried 2GB and 4GB). I'm assuming others have observed the same and so I'm wondered if anyone can help to overcome this issue please.

TimMcCauley commented 2 years ago

Hi @elliveny - thanks for reporting this. I assume this is a Valhalla issue directly and not necessarily something to do with running it within docker. What do you think @nilsnolde ?

elliveny commented 2 years ago

Thanks and hello @TimMcCauley. I'd be keen to understand if this is something others have observed. I'm actively making config changes to valhalla.json in an attempt to overcome it, but making little progress just yet. I'm happy to post an issue against valhalla itself but given the container is defining the runtime environment I figured it was better to raise it here first.

nilsnolde commented 2 years ago

What’s the area you’re loading? 2/4 GB is pretty low. I doubt there’s memory leaking, it’s just Valhallas architecture to keep loading new tiles in memory as they’re needed. Please have a look at the upstream repo’s config generator (it’s in ./scripts/valhalla_build_config.py) to look for options how to control memory access.

elliveny commented 2 years ago

Thanks @nilsnolde. Sorry, I should have said, I'm using data from https://download.geofabrik.de/europe/britain-and-ireland.html. I'll study the config generator to see if it sheds some light. Just considering your comment and trying to benefit from your experience, what amount of memory would you consider for the data I'm using?

nilsnolde commented 2 years ago

Sorry almost no actual experience, we’re not hosting. For performance, at the least the size of the tar as memory. Further thinking about it: pretty sure the root problem is this managed AWS service. I’ve seen ridiculous behavior before, eg Azure killing a container when it needs more than 30 mins to be ready. Likely there’s a memory setting in place where it kills the container above a certain threshold? Valhalla itself shouldnt run down your machine to this point. It has a tile queue which will be flushed or at least trimmed if necessary. Lots of options in the config generator, look at the help strings. And then in all the issues and discussions around tile queue. Good luck!

TimMcCauley commented 2 years ago

Just one thought I want to leave here is to try the setup without docker on the same box and measure the metrics side by side. At least this way we could rule out some bits then...

Edit: Sorry, slightly blind...I mean test on a EC2 instance.

elliveny commented 2 years ago

I'll do that and will update with my findings in a few days. Thanks for your help so far.

elliveny commented 2 years ago

Further to my previous messages on this matter over the past month or so we have:

Run Valhalla on an Ubuntu 20.04 AWS EC2 instance where we observed the same issues as we've seen using ECS (Elastic Container Service)
Tried a number of different CPU/Memory settings
Tweaked the Valhalla configuration, trying as many different settings as seemed appropriate

Regardless of any of the above we still see memory increase continually, eventually becoming exhausted. Here's the trend over the past few days:

We are now running Valhalla with regular memory check. At 73+% utilisation we issue a manual restart of the Valhalla task, which brings up a new container instance, ensures it is healthy, then stops the old task. It works but it isn't nice.

We've concluded there is definitely a memory leak happening within the Valhalla service process. I'm assuming I should move this issue over to https://github.com/valhalla/valhalla but given I raised this issue here to begin with I thought to update it here first and wonder if you have any further suggestions which might help me.

Thanks!

nilsnolde commented 2 years ago

right, I kinda forgot about this one already.. actually we're just having a bigger conversation around this, see https://github.com/gis-ops/docker-valhalla/issues/63 and the linked discussion in the valhalla upstream repo. it seems that somehow it's our image here that provokes not releasing memory when it really should. apparently the upstream valhalla docker image (on which our image is based) doesn't have that problem. I'm personally concluding then that it's somehow rooted in how we run the docker-entrypoint script. I'm really frustrated by that tbh, won't be easy to debug probably and much easier if it was a valhalla problem;) I also won't have the time or energy to dig in too much over the next weeks. we don't use this image ourselves for hosting, neither does any of our clients, so my interest to resolve is more out of technical curiosity. I'd appreciate community input for sure.

elliveny commented 2 years ago

Thanks Nils. Given we've run Valhalla on EC2 and seen the same issue, personally I have my doubts that the issue is bound to the docker entrypoint script. I have experimented and we've customised the script somewhat as a part of our investigations. I'll catch up fully with the issue you referenced but it's notable that we no longer have the '#exec "$@"' line present in our configuration.

nilsnolde commented 2 years ago

ah, then I misread: you installed valhalla from source on that machine? can you provide some more details on your setup:

how many threads are you running valhalla with?
what's the primary endpoints you're using? is it also matrix (as in the referenced valhalla issue)?
how did you set up valhalla? just valhalla_service or the individual workers & prime_server proxies etc as processes?
are you still using 2 or 4 gb instances? how big is the graph directory/tar file?

what looks strange from the memory graphs you posted: the steep memory increase just before it shuts down.. do you have an explanation for that? is that just an artifact of AWS or actually happening?

I didn't try it myself (yet) but it seemed the issue is not present for the OP in source valhalla and/or the upstream docker image. dave even ran valgrind memcheck on the matrix endpoint and (as far as I understand) it seemed fine.

elliveny commented 2 years ago

Yes, our EC2 setup followed the build instructions at https://github.com/valhalla/valhalla/tree/master.
The container environment runs with 2 CPUs and 8GB of memory, and so Valhalla runs with 2 threads, based upon those available CPUs.
Our requests are all very regular, many per second, and all are map matching trace_attributes requests. It's perhaps worth noting that the container environment is also set to call the status endpoint every 60 seconds as a health check.
Valhalla is running from the valhalla_service command, it looks like this in our container run.sh. Notice we're filtering the output because this is all logged into Cloudwatch Logs and we needed to reduce them:

sudo -E valhalla_service ${CONFIG_FILE} ${server_threads} | grep "ERROR\|WARN" --line-buffered --colour=never

When testing with EC2 we started the service from the command line with the config file path (nothing more)
Data wise, we're using tiles created from britain-and-ireland-latest.osm.pbf, which produces a 2.23GB tar file

The sharp rise in memory at the end is very likely the new container task being started so we're seeing the total of both tasks whilst one starts, then the other stops. I haven't proved that yet but believe that's what we're seeing (I'll do some study and report back).

nilsnolde commented 2 years ago

thanks for the details! nothing really stands out to me in terms of red flags. if you cover the entire region (over time) on both threads I'd expect max 4.5 gb of tile cache in total (in case you increased the defaults in the config). everything else should be "garbage-collected", the latest with a new request on the same thread. I honestly have 0 idea what the problem might be..

if you collect some measurable stats from the source valhalla usage, that'd be most useful to know in the upstream repo, e.g. you could script smth doing lots of requests in a short-ish time while monitoring the memory. important would be to control the requests (not completely random), e.g. 100 random, longer routes and use only those routes/locations but a few hundred/thousand times. that should increase in RAM until all necessary tiles are loaded and then flatten out. that'd be my expectation.

elliveny commented 2 years ago

Thank you for that. I'm thinking I'll endevour to make time to setup and share an Ubuntu 20.04 EC2 image which is setup with Valhalla along with a script which issues requests from the data I'm using (a long list of curl requests I should think), that way I can properly demonstrate the issue. I don't have much experience of debugging/monitoring on Unix machines but hopefully that might help someone who has? I welcome any suggestions regarding what might help most before make a start on that.

nilsnolde commented 2 years ago

yeah that'd help tremendously! as for me, you could skip creating an instance. just having a text file with a long list of request parameters (ideally JSON encoded and new-line separated) would help a lot.

elliveny commented 2 years ago

Great ok. I'll see what I can do

elliveny commented 2 years ago

I'm working on a script in this form:

#!/bin/bash
curl -X POST $1/trace_attributes -s -S -o /dev/null -H 'Content-Type: application/json' -d '{"shape":[{"lat":55.9218081,"lon":-3.3078672},{"lat":55.9219719,"lon":-3.3082254},{"lat":55.9221152,"lon":-3.3085263},{"lat":55.9223046,"lon":-3.3089716},{"lat":55.9224468,"lon":-3.3093197},{"lat":55.922539,"lon":-3.3095831},{"lat":55.9226142,"lon":-3.309803},{"lat":55.9226962,"lon":-3.31006},{"lat":55.9227795,"lon":-3.310377},{"lat":55.9228757,"lon":-3.3107633},{"lat":55.9229517,"lon":-3.3111168},{"lat":55.9230058,"lon":-3.3114333},{"lat":55.9230419,"lon":-3.3117176},{"lat":55.923084,"lon":-3.3121092},{"lat":55.923102,"lon":-3.3124311},{"lat":55.9231131,"lon":-3.3127106},{"lat":55.9231131,"lon":-3.3130539},{"lat":55.9231071,"lon":-3.3134348},{"lat":55.923081,"lon":-3.3138151},{"lat":55.923053,"lon":-3.3141429},{"lat":55.9230019,"lon":-3.3144916},{"lat":55.9229277,"lon":-3.3148933},{"lat":55.9228495,"lon":-3.315242},{"lat":55.9227622,"lon":-3.31558},{"lat":55.9226199,"lon":-3.3160218},{"lat":55.9225102,"lon":-3.3163625},{"lat":55.922396,"lon":-3.3166658},{"lat":55.922244,"lon":-3.3170697},{"lat":55.9210503,"lon":-3.3198608}], "filters": {"attributes": ["edge.way_id", "edge.speed", "edge.length", "matched.point", "matched.type", "matched.edge_index", "edge.id", "matched.distance_along_edge", "matched.point", "edge.begin_shape_index", "edge.end_shape_index", "shape", "matched.distance_from_trace_point"], "action": "include"}, "costing": "auto", "shape_match": "map_snap"}'
curl -X POST $1/trace_attributes -s -S -o /dev/null -H 'Content-Type: application/json' -d '{"shape":[{"lat":51.7555381,"lon":-0.2451234},{"lat":51.7559362,"lon":-0.2449958},{"lat":51.7563438,"lon":-0.2448184},{"lat":51.7567359,"lon":-0.2446096},{"lat":51.7571998,"lon":-0.2443428},{"lat":51.7578005,"lon":-0.2439743},{"lat":51.7582035,"lon":-0.2436988},{"lat":51.7584195,"lon":-0.2435278},{"lat":51.7585804,"lon":-0.2434004},{"lat":51.758878,"lon":-0.2431274},{"lat":51.759112,"lon":-0.2429068},{"lat":51.7592625,"lon":-0.2426901}], "filters": {"attributes": ["edge.way_id", "edge.speed", "edge.length", "matched.point", "matched.type", "matched.edge_index", "edge.id", "matched.distance_along_edge", "matched.point", "edge.begin_shape_index", "edge.end_shape_index", "shape", "matched.distance_from_trace_point"], "action": "include"}, "costing": "auto", "shape_match": "map_snap"}'
...

I'm prepping about 10000 unique requests like that. Does that look suitable?

TimMcCauley commented 2 years ago

@elliveny when you are running Valhalla inside docker are you limiting its memory? If not, what happens if you do?

elliveny commented 2 years ago

@TimMcCauley The container definition sets the available memory, I've not limited it in any other sense. Is that what you meant?

{
    "taskDefinitionArn": "arn:aws:ecs:eu-west-1:...",
    "containerDefinitions": [
        {
            "name": "web",
            "image": "1234567654321.dkr.ecr.eu-west-1.amazonaws.com/...",
            "cpu": 0,
            "links": [],
            "portMappings": [
                {
                    "containerPort": 8002,
                    "hostPort": 8002,
                    "protocol": "tcp"
                }
            ],
            ...
        }
    ],
    "family": "ValhallaECSValhallaAPITaskDefinition",
    ...
    "cpu": "1024",
    "memory": "8192",
    "tags": []
}

Originally I ran this using less memory (I increased it following the comment above) and the only difference I observed was that memory became exhausted more quickly. I haven't attempted to do this again. The original memory utilisation trace (see the first comment in this issue) shows less memory, in this case we allowed the valhalla_service process to crash, something which the memory check (mentioned above) avoids.

elliveny commented 2 years ago

I'm working on this still and thought to provide an update; I've finished the creation of a test set of CURL requests, eventually producing about 260000 of them. I wasn't able to use the same data as we're using in production - due to data ownership and privacy concerns - so instead I created code which utilises the Overpass data/service to render unique trace_attributes requests from motorway, primary, secondary and tertiary highway OSM ways.

I've setup a new EC2 Ubuntu 20.04 instance, with 2 CPUs and 8GB memory, and have built Valhalla from source. I established a default config using valhalla_build_config, then rendered the latest britain-and-ireland-latest.osm.pbf tiles. Now I've started the Valhalla service using both CPUs and am now running the test script on the same instance using the localhost endpoint. Memory utilisation is currently at 25% and climbing.

I'm expecting the Valhalla service to crash once memory becomes exhausted. I'll report back if/when that happens!

nilsnolde commented 2 years ago

great thanks @elliveny . it'd be so strange if valhalla upstream has problems like that.

elliveny commented 2 years ago

Ok @nilsnolde and @TimMcCauley I have a conclusion to my test. The Valhalla service on my EC2 instance has now crashed (at least, it has stopped and I didn't stop it). I'm not experienced in Linux tools which might help diagnose the issue and my approach was simply to issue regular ps -aux statements. I caught this one a couple of minutes before the process stopped:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu      1089 16.3 99.9 15247492 13345116 pts/1 Sl 12:45  50:01 valhalla_service valhalla.json 2

Note the %MEM figure. This appears to show the process exhausted available memory.

I had setup the AWS Cloudwatch agent to provide a trace of memory over time and these show a curious picture, one which doesn't match my expectation given the 99.9% detailed above. I could contact AWS support regarding it, if this is best, but here's the trace up to the point of the crash:

Given this conclusion I'm keen to share my test script and other details. The script is 268MB so thoughts on the best means of sharing it would be appreciated. Perhaps I should setup a Github repo and upload the files there?

nilsnolde commented 2 years ago

thanks a ton @elliveny . that's a strange picture indeed.. the constant increase in RAM still could make sense if new tiles have to be loaded and you're using 2 threads, so potentially you get twice the whole dataset in RAM.

we'll have to try it to at some point. you could attach a JSON here with some requests (probably need to zip them up), that would surely help and you already have it. 20 requests or so is enough actually. if it's a potential memory leak it will show up when running those thousands times over. really busy with other things these days, we'll get to it at some point though:)

elliveny commented 2 years ago

Thanks Nils - I'm just sharing this against one of the upstream issues you referenced (https://github.com/valhalla/valhalla/issues/3556), just to make the people there aware of these conclusions.

I take your point and will start a test today using a repeating set of requests, it'll surely make it easier to share if it's as few as 20 requests!

ktnr commented 1 year ago

I encountered the same behavior as described here and in https://github.com/valhalla/valhalla/issues/3556#issuecomment-1058244884 resp. #63.

I run gisops/valhalla:latest in WSL 2 on my local machine. I query the matrix endpoint

curl http://localhost:8002/sources_to_targets --data

with 150 locations in Germany: germany_150x150.zip.

The first request takes about a minute to complete and the docker container reports 4.6 GB of RAM usage. I repeat the exact same request, again it takes about a minute to complete but RAM usage increases to 9GB. Third time: 13 GB...

Let me know if I can provide more data to help.

nilsnolde commented 1 year ago

Argh.. This keeps being a topic.. Thanks for the details. I'll try to reproduce it one day, not much time right now. Can you share your full valhalla config json? That'll definitely help. Also you could try the native valhalla docker image: https://hub.docker.com/r/valhalla/valhalla/tags. It's more of a hassle, not much magic in there, but would be interesting to see if it behaves the same for you with the same valhalla config. And if not, I don't know what the problem could possibly be. But again, I've seen that sort of stuff before with vastly different software (a PostGIS container).

ktnr commented 1 year ago

Sure: valhalla-config.zip. I had to increase some max values in the config to have it return a result.

I'll put trying the native valhalla image on my list. Although, based on the past reports here, I expect it to work fine with the native image. Really quite odd that it only appears to leak while using docker.

Maybe it's not even a leak but the data is just loaded into memory multiple times? I am no docker expert though, so I couldn't possibly speculate...

nilsnolde commented 1 year ago

I expect it to work fine with the native image.

Yeah that's what I heard so far as well.

Really quite odd that it only appears to leak while using docker.

Well, the native image is also being run in docker, right.. So it doesn't even seem to be a docker problem per se. There's a lot memory mapping happening, especially if your RAM is less than the tileset's size, but for the same requests it shouldn't map other/more regions. Then there's the "volatile" stuff, where we keep big chunks of memory reserved for the algorithm to work properly, but that definitely shouldn't increase over requests when they are the same.

Also, we're not using any docker deployment in any prod environment, so it's more curiosity for me, and that doesn't go a very long way when there's a lot paid work to do unfortunately.

ktnr commented 1 year ago

Tested on the native image and it works without problems... I even used the same files, i.e., the preprocessed files that were generated from running the gis-ops image.

EDIT: both images exhibit the same behavior when server_threads is the same. I misleadingly thought the memory increase is not present in the native image because the compose file that I used set server_threads to 1 and I just didn't know that was the case (the server_threads option is specified in a different place for gis-ops/valhalla and valhalla/valhalla).

nilsnolde commented 1 year ago

WTF :joy: Thanks for confirming! Then I honestly can't do much about it I fear.. Can only hope some serious docker expert can chime in at some point.

nilsnolde commented 1 year ago

See discussion in https://github.com/gis-ops/docker-valhalla/issues/81

gis-ops / docker-valhalla

Valhalla memory leak #58