Potential memory leak in Matrix API

nmalasevic commented 2 years ago

Hello,

When building bigger matrices >250+ I am having a memory leaks all over the place and it doesn't seem like it's caching issue since I am replaying the same matrices and memory usage increases linearly. Also, when doing these bigger matrices there is a huge memory usage, for example 572x572 is using 12+GB of RAM (and not releasing it once done) - is this normal? If any data/log is needed I can provide!

Kind regards!

ImreSamu commented 2 years ago

Also, when doing these bigger matrices there is a huge memory usage, for example 572x572 is using 12+GB of RAM is this normal?

What is the BBOX of the 572x572 matrix? ( a rectangular Polygon from the minimum and maximum values for your matrix )

this is a "city-size" or a "country-size" matrix query?

If any data/log is needed I can provide!

Please document at least the minimal info.

Valhalla version ; platform; valhalla config; matrix bbox size ; ...
- or any other relevant info.

nmalasevic commented 2 years ago

Hi,

I am using latest gis-ops docker container, so Valhalla version is 3.1.4 It's city-size matrix for city of Lisboa. I am attaching the actual request (post-data) and valhalla config to this message (had to zip it, since github doesn't accept *.json).

Also, nothing strange in logs, just bunch of: [WARN] Local index 10 exceeds max value of 7, returning heading of 0

Thanks!

valhalla_matrix.zip

ImreSamu commented 2 years ago

I am using latest gis-ops docker container, so Valhalla version is 3.1.4

Thanks,

please show your docker images digest :-)

$ docker images --digests  gisops/valhalla
REPOSITORY        TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
gisops/valhalla   latest    sha256:e9e27e41abac53815bce765a1798434f759d54167bab9bb260f88710ed5727c0   9144a47ac815   3 days ago     414MB
gisops/valhalla   3.1.4     sha256:7ff85a4951dd9e1c9bf00edb53bbfb27b4b11127a05b26610b666e133900b603   ec8f928a2c26   4 months ago   388MB

And Valhalla git hash

 docker run -it --rm --entrypoint ""   gisops/valhalla:latest  cat /usr/local/src/valhalla_version
https://github.com/valhalla/valhalla/tree/2e5db62fa7d2ae9775d5209b10d02eac1541ee02

It's city-size matrix for city of Lisboa.

please give some info about your setup parameters

it is just a Portugal OSM data? ( https://download.geofabrik.de/europe/portugal.html )
or a full planet ?
or ..

ImreSamu commented 2 years ago

It's city-size matrix for city of Lisboa.

imho - it is a little bigger .. ( please verify me )

https://gist.github.com/ImreSamu/908e79c8f4faa1404623e4f7b0916fe6

$ cat valhalla_matrix_request.json  | grep lon | cut -d'"' -f4 | sort -n | (head -n1 && tail -n1)
-9.44414
-7.8621
$ cat valhalla_matrix_request.json  | grep lat | cut -d'"' -f4 | sort -n | (head -n1 && tail -n1)
37.92608
41.69507

nmalasevic commented 2 years ago

please show your docker images digest :-)

$ docker images --digests gisops/valhalla
REPOSITORY        TAG       DIGEST                                                                    IMAGE ID       CREATED      SIZE
gisops/valhalla   latest    sha256:e9e27e41abac53815bce765a1798434f759d54167bab9bb260f88710ed5727c0   9144a47ac815   3 days ago   414MB

And Valhalla git hash

$ docker run -it --rm --entrypoint ""   gisops/valhalla:latest  cat /usr/local/src/valhalla_version
https://github.com/valhalla/valhalla/tree/2e5db62fa7d2ae9775d5209b10d02eac1541ee02

please give some info about your setup parameters

It's not only Portugal, server has OSM data for Portugal, Germany and Serbia:

nmalasevic commented 2 years ago

imho - it is a little bigger .. ( please verify me )

You are correct! Some of my drivers (VRPTW) are outside of Lisboa! Or data is not properly geocoded! - will check this!

But, I get the same behaviour even with smaller datasets, I am attaching smaller Porto dataset (220x220) with only city of Porto included and I get the same memory leak (although overall memory usage is smaller). First run uses ~2.6GBs of memory and then each subsequent run adds about ~1.3GB without freeing previously used memory.

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -8.70765,
              40.9699554
            ],
            [
              -8.4952423,
              40.9699554
            ],
            [
              -8.4952423,
              41.25775
            ],
            [
              -8.70765,
              41.25775
            ],
            [
              -8.70765,
              40.9699554
            ]
          ]
        ]
      }
    }
  ]
}

porto_matrix_request.zip

kevinkreiser commented 2 years ago

i mean if its a leak even a single request should show the issue under valgrind so we should try to narrow it down to the smallest interaction with the api that exposes the issue. i dont have time to check it at the moment but it should be pretty straight forward to just run matrix under valgrind to see if it detects leaks that are in our own code. the only other quesiton i have is which matrix algorithm we are using as they both have very different memory charateristics. i guess i'll poke around at it a bit tonight

dnesbitt61 commented 2 years ago

CostMatrix uses a lot of memory - it has edge status, edge labels, and adjacency lists for each location (actually 2 per location, one when used as a source and one when used as a target). One possibility is the Clear method is called at the start of each call (to Clear the memory from the last call) - I don't think it is cleared at the end of each call. Clearing the temporary memory can be expensive. I think the initial thought was to somehow clear the memory after serialization occurs (to minimize latency) but that was likely never completed.

kevinkreiser commented 2 years ago

@dnesbitt61 the way the service code is structured should allow for what you described (clearing after serialization but before answering the next request), it is how all the other methods work. or maybe im misunderstanding what you're saying here

dnesbitt61 commented 2 years ago

I could be wrong. It has been a long time since I've looked at matrix code. It looks like the CostMatrix object goes out of scope (in thor/matrix_action) which hopefully will clear memory.

nmalasevic commented 2 years ago

Something is definitely going on here. I am attaching big request (572x572) with fixed geocoding (the one I previously shared, now it's city wide) and after every request memory doesn't get released, so after couple of consequent requests server simply crashes. lisboa_matrix_request.zip

nilsnolde commented 2 years ago

Can you try to build Valhalla from source on your server (or local)? I’d like to rule out some docker weirdo stuff. I’ve definitely seen that happening before with other images (specifically a Postgres image), a constant increase in RAM even though there’s no obvious reason software-side. If you’re that far and it’s still happening, then trying valgrinds memcheck to track memory leaks is easy: https://valgrind.org/docs/manual/quick-start.html

nilsnolde commented 2 years ago

are you using some external monitoring tools like datadog? that was the case for me back then, but I never got to really check and debug.. not sure if that could even be related, but I also never had issues ever before with using that postgres image on other machines & projects.

nmalasevic commented 2 years ago

I will try to build it from source as soon as I get some time to do it. I am not using any special monitoring tools, just tailing "free -m" in linux console.

kevinkreiser commented 2 years ago

so after couple of consequent requests server simply crashes

that does sound like a problem to me haha. i didnt get a chance to verify last night but ill add it to the list!

nmalasevic commented 2 years ago

Ok, I had some time to play with this, I built it from source and here are my findings: Running natively:

~5GB of memory is being used when running the example I posted above
after execution is done, ~1GB of memory is being released while other ~4GB are still being used
running subsequent requests increase memory usage by ~1GB and after request is executed that ~1GB gets released (meaning the server won't crash)

Running in docker container:

~5GB of memory is being used when running the example I posted above
after execution is done, ~1GB of memory is being released while other ~4GB are still being used
running subsequent requests increase memory usage by ~5GB and after request is executed only ~1GB gets released
when left out of memory, server crashes

Based on this, two things:

Is it ok that not all memory gets released (even when running natively)? I guess there is some buffering/caching going on?
I will try to build vanilla valhalla container and see what is the behaviour while running inside of it compared to the gis-ops one.

nilsnolde commented 2 years ago

interesting, thanks for the analysis! really wonder what the heck is going on with a docker deployment.. do you see the same for regular routing? if you make some long requests it should show fairly quickly as well.

valhalla caches lots of tiles, but 4 GB sounds too much (and likely exceeds your configured cache settings, defaults to 1 GB I think). but can't say much, lets see what kevin says.

as to 2. you don't have to build yourself: https://hub.docker.com/r/valhalla/valhalla/tags. choose one of the run- tags, our gis-ops image is based on that too. no "magic" here though, you'll have to mount the volume and exec into the container to start the service.

nmalasevic commented 2 years ago

choose one of the run- tags, our gis-ops image is based on that too. no "magic" here though, you'll have to mount the volume and exec into the container to start the service.

Hah, this is exactly what I just did :) And results I got match the "native" behaviour - ~4GB is still cached, but running subsequent requests only increase and decrease the memory by ~1GB, so now at least my server is not crashing anymore. I guess something is happening with gis-ops container... I am able to reproduce this behaviour in multiple instances (both local and remote servers).

nilsnolde commented 2 years ago

ouch!! some of that "magic" biting us I guess.. thanks for letting us know, we'll see what we can about it.

just to confirm: the request showing the residual 4 GB memory were on the matrix request only covering porto, a single city?

nmalasevic commented 2 years ago

Nope, it's when running Lisboa dataset (bigger one 572x572, can be found here https://github.com/valhalla/valhalla/issues/3556#issuecomment-1056566140), but it's city wide - I fixed the locations and now it's 100% city-wide.

Also, the problem with container is not that 4GB of memory that's not released, it's the fact that in subsequent request none of that cached memory is being used at all, so after two requests you have 8GB of cached memory and it's constantly increasing until it crashes.

nilsnolde commented 2 years ago

yeah sure, wasn't asking for gis-ops image. 4 gigs of ram just sounds to me way too much in terms of (tile?) cache for a single city (regardless if it's porto or lisbon). gis-ops image has other problems apparently.. not looking forward to that one..

dnesbitt61 commented 2 years ago

I doubt most of the memory use here is for tile cache/data lookup. With 512x512 matrix there would be 1024 EdgeLabel lists, 1024 adjacency lists (double bucket queue), 1024 EdgeStatus lookups - I suspect the majority of memory use comes from those temporary structures. The way CostMatrix is called within matrix_action.cc it seems that all temporary memory would be cleaned up when the costmatrix object goes out of scope (code below):

  auto costmatrix = [&]() {
    thor::CostMatrix matrix;
    return matrix.SourceToTarget(options.sources(), options.targets(), *reader, mode_costing, mode,
                                 max_matrix_distance.find(costing)->second);
  };

One part I am unsure of is that the adjacency lists are: std::vector<std::shared_ptr<baldr::DoubleBucketQueue<sif::BDEdgeLabel>>> target_adjacency_; I am not sure how the shared_ptr allocation(s) behave when the object goes out of scope?

kevinkreiser commented 2 years ago

does the vector go out of scope as well? if so then they would be deallocated unless something else has a copy of that shared pointer. they basically use reference counting to determine if they will dealloc or not

dnesbitt61 commented 2 years ago

Running valgrind with valhalla_service and a planet data set (no tar file) with 1 thread: valgrind ./valhalla_service ../../conf/new_planet.json 1

I use the attached port_matrix_request.json request: curl -X POST -H "Content-Type: application/json" -d @./porto_matrix_request.json http://localhost:8002/sources_to_targets

Note: I see memory allocation ~1.5 - 1.7GB with this request (likely mostly from the temporary objects)

==246004== Process terminating with default action of signal 2 (SIGINT) ==246004== at 0x4F09CD7: __pthread_clockjoin_ex (pthread_join_common.c:145) ==246004== by 0x4C89046: std::thread::join() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==246004== by 0x17EDB9: main (in /data/sandbox/valhalla/build/valhalla_service) ==246004== ==246004== HEAP SUMMARY: ==246004== in use at exit: 125,278,114 bytes in 215,192 blocks ==246004== total heap usage: 30,485,863 allocs, 30,270,671 frees, 4,567,767,493 bytes allocated ==246004== ==246004== LEAK SUMMARY: ==246004== definitely lost: 0 bytes in 0 blocks ==246004== indirectly lost: 0 bytes in 0 blocks ==246004== possibly lost: 298,257 bytes in 3,771 blocks ==246004== still reachable: 124,979,857 bytes in 211,421 blocks ==246004== of which reachable via heuristic: ==246004== multipleinheritance: 31,296 bytes in 24 blocks ==246004== suppressed: 0 bytes in 0 blocks ==246004== Rerun with --leak-check=full to see details of leaked memory

dnesbitt61 commented 2 years ago

I changed my local code to not use shared_ptr within the adjacency list vectors. To @kevinkreiser question - all the vector objects are class members and I assume that when the local CostMatrix object goes out of scope these would be deallocated.

kevinkreiser commented 2 years ago

yep they would, sorry i'm not looking at the code at the moment

nmalasevic commented 2 years ago

I use the attached port_matrix_request.json request:

What happens when you run it with bigger dataset? One example is attached here: https://github.com/valhalla/valhalla/issues/3556#issuecomment-1056566140

In my case, the bigger the dataset is, the more memory gets unreleased...

dnesbitt61 commented 2 years ago

I am using a planet dataset. Running with the larger request and 1 thread: ./valhalla_service ../../conf/new_planet.json 1 I see memory grow to 9.5GB after the first request (takes 94 seconds on my laptop). After the 2nd request the memory shows (using top) as: 9.8GB. It has not gone above 9.8GB since with several more runs. Are you running with multiple threads? If so, are you using the global_synchronized_cache? I think that would share the tile cache between threads - though I suspect the memory cached for tiles is pretty small so the impact there isn't likely to be too large. This is NOT using Docker by the way.

nmalasevic commented 2 years ago

I am running smaller dataset (only three contries). Now debugging with valgrind and native (NOT docker) and I get similar behaviour like you (leak is happening in gis-ops container, vanilla valhalla one as well as native are fine). I am running it with single thread and my question was: Is it ok that memory stays at those high values (even though it's not growing past certan value)? After first request, only part of memory gets released (by top), but big part of it seems to be allocated still. For next requests it grows roughly only by previously released value (similar chunk of it gets released after every request finishes).

Also, thanks for being so active and supportive! I really appreciate it!

==32499== Process terminating with default action of signal 2 (SIGINT)
==32499==    at 0x4DC0019: __futex_abstimed_wait_common64 (futex-internal.c:57)
==32499==    by 0x4DC0019: __futex_abstimed_wait_common (futex-internal.c:87)
==32499==    by 0x4DC0019: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==32499==    by 0x4DC5483: __pthread_clockjoin_ex (pthread_join_common.c:105)
==32499==    by 0x4AF28F6: std::thread::join() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
==32499==    by 0x18710E: main (in /usr/local/bin/valhalla_service)
==32499== 
==32499== HEAP SUMMARY:
==32499==     in use at exit: 46,561,439 bytes in 35,712 blocks
==32499==   total heap usage: 31,743,229 allocs, 31,707,517 frees, 8,793,435,524 bytes allocated
==32499== 
==32499== LEAK SUMMARY:
==32499==    definitely lost: 96 bytes in 1 blocks
==32499==    indirectly lost: 0 bytes in 0 blocks
==32499==      possibly lost: 313,212 bytes in 3,946 blocks
==32499==    still reachable: 46,248,131 bytes in 31,765 blocks
==32499==                       of which reachable via heuristic:
==32499==                         multipleinheritance: 36,288 bytes in 24 blocks
==32499==         suppressed: 0 bytes in 0 blocks
==32499== Rerun with --leak-check=full to see details of leaked memory

dnesbitt61 commented 2 years ago

I think it is fine that the listed memory is still high. I think the OS recovers memory in a lazy fashion - so you won't immediately see all allocated memory go away. At least that is my understanding - others may know more. Large matrices in Valhalla consume a lot of memory and take a while to complete - that is why the default server limits are pretty low. I have the large curl request in a loop and have hit my server ~12 times and I do not see any real memory growth (top has shown 9.8GB as the highest RES memory use that I have seen).

elliveny commented 2 years ago

I observed an apparent memory leak issue whilst working with docker-valhalla and it appears to be related to the issue discussed here. I've been able to demonstrate the same issue on an EC2 Ubuntu 20.04 8GB, 2 CPU instance in a 'native' (no docker) and built-from-source valhalla setup, producing installation notes and a 'crash-the-valhalla-service' test script of CURL requests. I've been sharing and recording the details in https://github.com/gis-ops/docker-valhalla/issues/58.

My test script is 268MB in size - it contains all-unique trace_attributes requests across 5+ node OSM ways taken from the britain-and-ireland data. @nilsnolde has suggested that a repeating set of a small number of requests (20 perhaps) ought to demonstrate the same issue, so I'm working to determine that today. Given I'm now working without docker I'll thinking I'll update this issue in preference to https://github.com/gis-ops/docker-valhalla/issues/58 so I'll report back with my findings later today.

elliveny commented 2 years ago

Running the same tests with 20 repeating trace_attributes requests is showing a very different memory utilisation profile to my previous test:

I've observed small drops in memory utilisation (you can't see these the graph). I also have a ps -aux output from the beginning of the test:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu       938 38.3  6.3 11057292 518692 pts/0 Sl   10:37  10:34 valhalla_service valhalla.json 2

and a current one:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu       938 51.5  6.8 11057292 558028 pts/0 Sl   10:37  75:24 valhalla_service valhalla.json 2

Very little change is evident since the start of the test.

The difference between my full test script and the 'flat' profile which results from these repeating requests suggests to me that the memory leak isn't in the request handling/cache lookup code, but instead it seems related to loading new data into memory. My full test script uses ways (highways) across the whole of the britain-and-ireland data, about 260000 unique requests and it appears that the need for new data causes the memory utilisation to increase quickly.

My test run continues, but I'll stop it soon if it continues to show the same memory utilisation flatline.

elliveny commented 2 years ago

I switched to my full test script on the same already running valhalla_service and immediately saw the memory utilisation increase. It got to 96% then I stopped the client processes to see what happened if the service was left idle for a while.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu       938 41.8 96.0 12499084 7821644 pts/0 Sl  10:37  89:08 valhalla_service valhalla.json 2

After a few minutes nothing had changed. I kind of hoped that leaving it idle might allow some housekeeping/garbage collection might take effect, but nothing was evident.

After restarting the client processes using the full test script memory utilisation stopped at about 96.2% for quite a while, suggesting most of the requests were being served from data already in memory, perhaps?

kevinkreiser commented 2 years ago

if you are using a memory mapped tar and it is larger than the size of your machines physical ram, then the os will allocate just up to the brink running out of ram and try to keep a little headroom for those processes not to get OOM killed. this is by design really. if you have other processes running on your computer/container that may take decent amounts of ram they will have to fight with the OS to get that ram its trying to use for the router. if you send the same 20 requests over and over the OS will only ever page that data into ram that those 20 requests intersect, so it is easy to see why that wouldnt cause ram to increase. however if you send tons of different requests that touch all parts of your data extract then you should expect to see the OS page stuff in until it has to start swapping between ram and FS cache.

we could have done something more complicated with memory mapping in valhalla to alleviate this (mapping only parts of the file insuring it would never be more than the system had) but this is not necessary. In my experience the OS does a good job of keeping a couple gigs of headroom for other things to happen and not letting one process take all the ram away.

the matrix api is different though, it no only uses the memory map but it also has to allocate a lot of stuff onto the heap dynamically as it runs. the larger the matrix the more of this it does (has to track all those paths). if you run lots of matrices all over your dataset so that the mmap uses tons of ram but you also do very large matrices those things will compete. i'm not sure exactly how to control mmap's greediness but it would be nice to tell it to give up if a heap allocation would fail otherwise.. the problem there is that most of those allocations need to be contiguous and that could cause a problem potentially. i have to research this a bit to find out how it should be working and if we can change it effectively.

another option is to not use memory mapping but rather use the tile_dir mode. you can also enable a hard limit on the amount of ram it can keep allocated so that cache will be cleared (in a LRU fashion) if that limit is exceeded. at the end of the day though we dont have a way to make sure you will never run out of ram. its always possible to schedule a workload that will need more ram than you have.

elliveny commented 2 years ago

Thanks Kevin. I'm fairly new to Valhalla and so some of what you've said has given me pointers for more study and investigation. At this point I think it might be worth me providing a quick summary of what I'm trying to achieve and the path that led me to this issue and the investigation I'm pursuing.

Right now I'm working with a stream of GPS data generated from UK based vehicles and I'm attempting to introduce Valhalla map matching to the data processing pipeline. To this end I setup the service using docker-valhalla and began testing the processing. Almost immediately I observed issues with the service running out of memory and crashing, which led me to begin asking questions over at https://github.com/gis-ops/docker-valhalla/issues/58 where @nilsnolde and @TimMcCauley have kindly been helping me. In my attempts to deal with the issue, I have changed the valhalla configuration as much as seemed relevant and have increased the available memory and cpu in an attempt to stabilise the service, but I've had no success.

Given my lack of success I was keen to find a way to share and demonstrate the issue, but immediately I hit an obstacle regarding the ownership and privacy of the data I'm using. To overcome this I prepared a test script from OSM way (highway) data, generating a large script of CURL requests. This script allowed me to successfully reproduce the issue I'm seeing with customer data, allowing me to share what I'm seeing here.

I'm running my test using tiles generated from britain-and-ireland-latest.osm.pbf, these produce a 2.32GB tar file. My Ubuntu instance has 8GB of memory, 2 CPUs and 30GB of disk. The valhalla configuration is the default one generated by valhalla_build_config as described in the README.md

My goal is simply to setup a stable Valhalla service which can deal with my data pipeline and appreciate any pointers which help me to achieve that. Adding my woes to this issue seemed relevant and appropriate, but I'm quite happy to move my concern out if you think otherwise.

With regard to my test environment I can comment on a few specifics from what you've said:

if you are using a memory mapped tar and it is larger than the size of your machines physical ram I'm not doing that, at least I don't think so. I can increase the memory in my instance if you'd recommend I do so.

if you send tons of different requests that touch all parts of your data extract then you should expect to see the OS page stuff in until it has to start swapping between ram and FS cache. That makes sense. Would it be right to assume that this concern would be avoided if physical ram is properly sized to allow the entire memory mapped tar to be loaded into working storage?

the matrix api is different though This is where, perhaps, I've walked into an issue which isn't relevant to my particular concern? I apologise if so.

kevinkreiser commented 2 years ago

@elliveny did you want to open up a more pointed issue to discuss your particular case? though it does seem somewhat related to this issue, i think you'd be best served by another discussion in another issue particular to your problem with map matching

karlbeecken commented 1 year ago

Hi, using native, self compiled valhalla with 16 workers (=#threads) after a few days the service fills up the memory so much that the server is barely usable. While I can totally understand that caching for large requests may take up some larger amount of memory, I think the software should not use up memory until the server crashes. It would be better to throw an out of memory error message or something from my point of view. I will happily provide more details if they're needed for a possible fix.

nilsnolde commented 1 year ago

you run with a tar file or a plain tiles dir?

if you're mem mapping the tar file, that really shouldn't happen that it crashes anything bcs of OOM (at least not bcs of the graph). but if you're running the planet on 16 GB RAM, yeah, your RAM will be decently full and if you run on top of that concurrent big-ish CostMatrixs, I can imagine how it gets very bogged down. there'd not be much valhalla can do, anyone operating it should have a decent understanding what it's doing and what infrastructure is needed (it took me a while too).

karlbeecken commented 1 year ago

I ran valhalla_build_extract -c valhalla.json -v during installation, so I suppose it tar'ed the files. Also it is "just" a europe import, running on 64 GB RAM (there's nothing else on the server)

karlbeecken commented 1 year ago

Also what I don't really get is why the memory isn't released after calculation. I can understand when it's needed during calculation, but it just won't stop after the calculation is finished.

kevinkreiser commented 1 year ago

@karlbeecken the general strategy in a long runnig server process is to allocate some heap and keep it around for the life time of the server so that you dont have to reallocate for every request. the reallocation is expensive and takes time so we dont deallocate everything on every request. what we do, at least for the /route endpoints is deallocate heap that goes beyond what we set our soft limit at. so we say "try" to use no more than eg 1GB for this labelset and the code will allow a given request to grow beyond that to fulfill the reuqest and then it will shrink it back down to that soft limit after the request finishes.

i would have expected matrix to do the same but clearly its not or we've introduced some new memory leak. if its the latter, we can easily check this with valgrind, if its the former its just an oversight. in either case once we have a very well established request set that elicits the behavior we can work on whichever, if either, turn out to be the culprit. i personally havent had the time to look into it but ill re-open this so we can report on it once we do

karlbeecken commented 1 year ago

So I have this set of requests which manages to hang up my 16 thread 64G server. I increased service limits to allow this large requests though, so a default valhalla instance won't accept the requests. query_string.txt

kevinkreiser commented 1 year ago

are you running with valhalla_service? do any of your reqeusts return results? i see these are the traveling salesman style requests, have you tried just the sources_to_targets endpoint to see if its actually the matrix or its the simmulated annealing afterwards that is causing it?

karlbeecken commented 1 year ago

I run it using valhalla_service valhalla.json <threads>(server has 16 threads, I tried with that and also some lower numbers, which did not have an impact on the memory problem). valhalla.jsonis attached (you'll have to rename it as GitHub does not allow JSON file uploads for some reason). valhalla.json.txt

And yes, the requests return results, up until the memory runs full and cpu load is at 100%. Then the only thing happening is a request timeout from my client (not even the nginx on the server is responive anymore at this stage).

I will try with source_to_targets next week.

KlemenSpruk commented 1 year ago

Hello! @karlbeecken did source to targets provide any different results? With matrix api I can confirm same memory issue as you described.

nilsnolde commented 1 year ago

Actually there are no memory issues, and we should close this issue.

It all depends on how you run valhalla, how many threads there are and how big of a request you handle. A bidir matrix single request can easily take up dozens of GB RAM, it's just inherent to the matrix algorithm, nothing we can reasonably do about. This is unfortunately not the right routing engine to handle those at scale. See here for a bit more context: https://github.com/gis-ops/docker-valhalla/issues/81#issuecomment-1359435350

valhalla / valhalla

Potential memory leak in Matrix API #3556