maptiler / tileserver-gl

Vector and raster maps with GL styles. Server side rendering by MapLibre GL Native. Map tile server for MapLibre GL JS, Android, iOS, Leaflet, OpenLayers, GIS via WMTS, etc.
https://tileserver.readthedocs.io/en/latest/
Other
2.24k stars 639 forks source link

Zombie processes after recent update #1236

Open boldtrn opened 6 months ago

boldtrn commented 6 months ago

We recently updated from 4.5.1 to 4.10.3. After the update we have seen quite some performance issues with our tile server. One thing that stands out to me is that we are getting zombie processes. We are using the Docker image.

The zombie processes are node commands apparently, so maybe there was an issue introduced along the way?

ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+    1551  0.0  0.0      0     0 ?        Z    09:46   0:00 [node] <defunct>
systemd+   66028  0.0  0.0      0     0 ?        Z    16:30   0:00 [node] <defunct>
systemd+   93659  0.0  0.0      0     0 ?        Z    19:26   0:00 [node] <defunct>
systemd+   94458  0.0  0.0      0     0 ?        Z    19:33   0:00 [node] <defunct>
systemd+   96641  0.0  0.0      0     0 ?        Z    19:46   0:00 [node] <defunct>
systemd+  101128  0.0  0.0      0     0 ?        Z    20:15   0:00 [node] <defunct>
systemd+  101942  0.0  0.0      0     0 ?        Z    20:21   0:00 [node] <defunct>
acalcutt commented 6 months ago

I am not aware of any issue that would cause that issue, but can you see if the just release 4.11.0 helps.

boldtrn commented 6 months ago

We had to revert to 4.5.1 for now. I will give this another try soon :)

acalcutt commented 6 months ago

Where are you running the command above, inside the docker image or on the host the docker is running on? Does it take time to build up like that?

boldtrn commented 6 months ago

I ran this on the host. I believe this might have been Docker processes that were killed or anything like this, as the user is systemd+ and Docker is managed by systemd, but this only an idea at this point.

boldtrn commented 6 months ago

I can verify that the error still persists with the latest release. I can't see anything obvious in the logs, but I have to admit it's a production system, so there are a lot of logs. If you have a possible hint what to search for in the logs I can give this a try. Obvious stuff like ERROR or FATAL did not show anything interesting.

acalcutt commented 6 months ago

Unfortunately I don't have any good answers on what to look for. If i had to guess it would be a rendering issue, since that starts it's own threads. I find when maplibre-native as an issues, it doesn't always give back an error.

When I am troubleshooting stuff like that I try to find a url that isn't loading as expected. I then test that url in a more contollable instance. usually in testing I uncomment https://github.com/maptiler/tileserver-gl/blob/master/src/serve_rendered.js#L874 to get an idea what is being loaded when maplibre-native fails.

Have you seen anything that is failing to load with the new version? you were using static images right?

boldtrn commented 6 months ago

Have you seen anything that is failing to load with the new version? you were using static images right?

We are using raster and vector tiles as well as static images. I haven't seen anything failing, we are serving several million tile requests per day, so it's hard to track down isolated issues. We had some performance issues but I doubt these are related to the version. We are currently running different version of tileserver-gl and CPU etc. usage look somewhat similar (actually the latest version seems to be about 5% less resource consuming)

acalcutt commented 6 months ago

Just an FYI, i did find an issue in the docker build caused by the change to use "is-ci" when dev utils were not included. I put that back to the old method in https://github.com/maptiler/tileserver-gl/pull/1250 . That should be fixed in 4.11.1

I'm not sure it has anything to do with your issue, but i thought it could be a possibility

boldtrn commented 6 months ago

I will give this a try, thanks 👍

boldtrn commented 5 months ago

Ok, I think the latest update 4.11.1 did indeed fix the zombie processes, I haven't seen them since. Thanks for looking into this @acalcutt 👍

boldtrn commented 5 months ago

Unfortunately, I have to reopen this issue. Zombie processes just reappeared yesterday on one of our servers. The container even went down and we had to restart it. Again the logs did not show anything new.

Screenshot 2024-06-15 at 09 04 03