tidwall / tile38

Real-time Geospatial and Geofencing
https://tile38.com
MIT License
9.15k stars 570 forks source link

Intermittent spikes in memory usage causing crashes #528

Open autom8ter opened 4 years ago

autom8ter commented 4 years ago

Describe the bug We are observing intermittent spikes in memory usage that result in tile38 crashing and restarting in our Kubernetes cluster. We normally hover around 45 mb but spike to over 1 Gb every once in a while. This crashes Tile38 and loses all of its in-memory channels, causing our geofence to completely stop working until we re-add all of the channels. image

To Reproduce Cant reproduce as its intermittent and has no noticeable correlations with request load.

Expected behavior

Logs These are the only logs off of the crashed instance:

_____ _______
  |       |       |
  |____   |   _   |   Tile38 1.19.3 (d48dd22) 64 bit (amd64/linux)
  |       |       |   Port: 9851, PID: 1
  |____   |   _   | 
  |       |       |   tile38.com
  |_______|_______| 
2020/01/22 22:13:09 [INFO] Server started, Tile38 version 1.19.3, git d48dd22
2020/01/22 22:13:09 [INFO] AOF loaded 0 commands: 0.00s, 0/s, 0 bytes/s
2020/01/22 22:13:09 [INFO] Ready to accept connections at [::]:9851
2020/01/22 22:13:12 [INFO] live 10.16.8.144:44570
2020/01/22 22:13:13 [INFO] live 10.16.10.79:60106
2020/01/22 22:13:14 [INFO] live 10.16.6.11:48644
2020/01/22 22:14:12 [INFO] not live 10.16.8.144:44570
2020/01/22 22:14:12 [INFO] live 10.16.8.144:45338
2020/01/22 22:14:13 [INFO] not live 10.16.10.79:60106
2020/01/22 22:14:13 [INFO] live 10.16.10.79:60464
2020/01/22 22:14:14 [INFO] not live 10.16.6.11:48644
2020/01/22 22:14:14 [ERRO] read tcp 10.16.6.207:9851->10.16.6.11:48644: read: connection reset by peer
2020/01/22 22:14:14 [INFO] live 10.16.6.11:49388
2020/01/22 22:15:12 [INFO] not live 10.16.8.144:45338
2020/01/22 22:15:12 [INFO] live 10.16.8.144:46100
2020/01/22 22:15:13 [INFO] not live 10.16.10.79:60464
2020/01/22 22:15:13 [INFO] live 10.16.10.79:32880
2020/01/22 22:15:14 [INFO] not live 10.16.6.11:49388
2020/01/22 22:15:14 [INFO] live 10.16.6.11:50124
2020/01/22 22:16:12 [INFO] not live 10.16.8.144:46100
2020/01/22 22:16:12 [INFO] live 10.16.8.144:46822
2020/01/22 22:16:13 [INFO] not live 10.16.10.79:32880
2020/01/22 22:16:13 [INFO] live 10.16.10.79:33510
2020/01/22 22:16:14 [INFO] not live 10.16.6.11:50124
2020/01/22 22:16:14 [INFO] live 10.16.6.11:50850
2020/01/22 22:17:12 [INFO] not live 10.16.8.144:46822
2020/01/22 22:17:12 [INFO] live 10.16.8.144:47558
2020/01/22 22:17:13 [INFO] not live 10.16.10.79:33510
2020/01/22 22:17:13 [INFO] live 10.16.10.79:33866
2020/01/22 22:17:14 [INFO] not live 10.16.6.11:50850
2020/01/22 22:17:14 [ERRO] read tcp 10.16.6.207:9851->10.16.6.11:50850: read: connection reset by peer
2020/01/22 22:17:14 [INFO] live 10.16.6.11:51588
2020/01/22 22:17:23 [INFO] not live 10.16.8.144:47558
2020/01/22 22:17:23 [INFO] not live 10.16.10.79:33866
2020/01/22 22:17:24 [INFO] not live 10.16.6.11:51588
2020/01/22 22:17:25 [INFO] live 10.16.7.82:35490
2020/01/22 22:17:32 [INFO] live 10.16.10.80:44218
2020/01/22 22:17:34 [INFO] live 10.16.6.12:37972
2020/01/22 22:18:17 [INFO] not live 10.16.6.12:37972
2020/01/22 22:18:17 [ERRO] read tcp 10.16.6.207:9851->10.16.6.12:37972: read: connection reset by peer
2020/01/22 22:18:18 [INFO] not live 10.16.7.82:35490
2020/01/22 22:18:18 [ERRO] read tcp 10.16.6.207:9851->10.16.7.82:35490: read: connection reset by peer
2020/01/22 22:18:18 [INFO] not live 10.16.10.80:44218
2020/01/22 22:18:21 [INFO] live 10.16.8.145:59276
2020/01/22 22:18:22 [INFO] live 10.16.10.81:58080
2020/01/22 22:18:24 [INFO] live 10.16.6.13:42196
2020/01/22 22:19:54 [INFO] not live 10.16.6.13:42196
2020/01/22 22:19:57 [INFO] live 10.16.6.13:43516

Operating System (please complete the following information):

Additional context We have about 1000 sites that are geofenced, we receive about 100 site_events/min over a redis channel using a standard go-redis client https://godoc.org/github.com/go-redis/redis. No dramatic increase in requests made to tile38 during the spikes.

tidwall commented 4 years ago

I wonder what is causing those quick spikes.

I would like to reproduce. The log and graph helps and so does the details on the number of geofences and events/min.

Could you also provide an example of what a typical geofence and SET objects look like in your system?

Also, are you using a swap file? Not that that would fix the underlying problem, but I wonder if, in the meantime, it could alleviate the memory pressure and keep the Linux out of memory killer from firing.

tidwall commented 4 years ago

@autom8ter Were you able to resolve this issue on your side?