torrust / torrust-demo

CI and configuration for the Torrust live demo.
https://index.torrust-demo.com
0 stars 0 forks source link

Sever is crashing due to memory swapping #1

Closed josecelano closed 5 months ago

josecelano commented 6 months ago

The droplet has started crashing, probably after the last update.

image image

Just after rebooting:

image

After a while:

image

It looks like the Index is using a lot of CPU.

I don't know the reason yet. I think this recent PR could cause it:

https://github.com/torrust/torrust-index/pull/530

Maybe the execution interval (100 milliseconds) is too short for this machine.

cc @da2ce7

josecelano commented 6 months ago

In my PC, with the current interval (100 milliseconds):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1342564 josecel+  20   0 2355216  38912  29440 S  20.2   0.1   1:26.22 torrust-index

With 1000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1346972 josecel+  20   0 2355216  38656  29696 S   2.3   0.1   0:03.38 torrust-index

With 2000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1348942 josecel+  20   0 2353168  37888  29952 S   1.3   0.1   0:00.59 torrust-index

With 3000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1350156 josecel+  20   0 2353288  37632  29696 S   1.0   0.1   0:00.61 torrust-index

With 4000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1351449 josecel+  20   0 2355216  38400  29440 S   0.7   0.1   0:01.23 torrust-index

With 5000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1353683 josecel+  20   0 2355224  37632  29440 S   1.0   0.1   0:02.17 torrust-index
Interval (secs) Number of torrents imported per hour
1 sec 50 * 3600 = 180000
2 sec 50 * (3600/2) = 90000
3 sec 50 * (3600/3) = 60000
4 sec 50 * (3600/4) = 45000
5 sec 50 * (3600/5) = 36000
josecelano commented 6 months ago

I've deployed a new version with the increased interval 2 hours ago and it hasn't crashed:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3721 torrust   20   0  750016 499496      0 S  20.5  50.9  23:08.13 torrust-tracker
   3867 torrust   20   0  577576  47760   6164 S   6.3   4.9   6:18.35 torrust-index

image image

josecelano commented 6 months ago

The problem is not solved yet :-/:

image

image

josecelano commented 6 months ago

It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:

  1. The new tracker statistics importer (even with the new 2-second interval).
  2. The new torrent repository implementation with the SkipMap. I don't think so because, in the previous implementation with BTreeMap, we were not limiting the memory consumption either.
  3. The tracker is receiving too many requests, so we need to limit the number of requests or the memory consumption.

The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):

docker ps
CONTAINER ID   IMAGE                       COMMAND                  CREATED        STATUS                    PORTS                                                                                                                                       NAMES
ee67d0951541   nginx:mainline-alpine       "/docker-entrypoint.…"   14 hours ago   Up 14 hours               0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp                                                                    proxy
31c65c66be26   torrust/index-gui:develop   "/usr/local/bin/entr…"   14 hours ago   Up 14 hours (healthy)     0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                                                                                   index-gui
a8b2e8977c5c   torrust/index:develop       "/usr/local/bin/entr…"   14 hours ago   Up 14 hours (healthy)     0.0.0.0:3001->3001/tcp, :::3001->3001/tcp                                                                                                   index
0cc1eefec6b6   torrust/tracker:develop     "/usr/local/bin/entr…"   14 hours ago   Up 10 minutes (healthy)   0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp   tracker

I'm going to try first switching back to the previous repository implementation with the BTreeMap (there could be some deadlock with the SkiMap). If we still have the problem, we need to limit requests and/or memory consumption.

On the other hand, I think we should improve the container healthchecks. We should not wait for 2 hours to restart the container.

josecelano commented 6 months ago

I've opened another issue to discover why there are so many zombie processes. I think there is an easy way to know if the healthcheks cause the crash on the server or not. I can run a private server with the same resources without requests.

Another thing I could check is the number of requests the tracker is handling. Maybe the number of requests per second has increased a lot, and the problem is the server can't handle those requests. In that case, the only solution would be:

  1. Limit the amount of requests (reject requests when the server is busy).
  2. Limit the memory consumption (we try to handle requests but remove torrents/peers when the torrent repository is full).
  3. Increase the server when CPU or memory usage goes over 85%.
  4. All of the previous ones.
josecelano commented 6 months ago

5 minutes after restarting the tracker container, I checked the stats:

udp4_connections_handled: 44661,
udp4_announces_handled: 25175,
udp4_scrapes_handled: 652,

The tracker is handling 234.96 req/sec. I think, in this case, the server could be simply busy, and we need to implement one of the solutions above (1 to 3).

cc @da2ce7

UDPATE:

Those are only the tracker client requests. We also have tracker API, Index, stats importer, ...

josecelano commented 6 months ago

It looks like the period without restarting is longer during the night. Maybe it's because there are fewer tracker requests.

image

Now there are 350 req/sec

In 2 hours
udp4_connections_handled: 824422,
udp4_announces_handled: 1646227,
udp4_scrapes_handled: 55240,

2525889 -> 350.817916667 req/sec
josecelano commented 6 months ago

It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:

  1. The new tracker statistics importer (even with the new 2-second interval).
  2. The new torrent repository implementation with the SkipMap. I don't think so because, in the previous implementation with BTreeMap, we were not limiting the memory consumption either.
  3. The tracker is receiving too many requests, so we need to limit the number of requests or the memory consumption.

The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):

I've been thinking again about why this is happening now. Regarding the SkipMap, maybe the SkipMap is consuming more memory than the BTreeMapand thas making the container to restart at the current load level.

Regarding the load level, It would be nice to have historical statistics. I would like to know if we have more requests than one or two weeks ago. That would help in understanding if the problem is we simply have more requests and we need to resize the server.

I'm going to resize the droplet from memory 1GB ($6) to 2GB ($12) to see what happens:

image

josecelano commented 6 months ago

Just after resizing the droplet:

image

josecelano commented 6 months ago

The server was restarted at 15:45 approx (now 17:43):

image

After restarting the server, the tracker has handled:

udp4_connections_handled: 1022134,
udp4_announces_handled: 1864182,
udp4_scrapes_handled: 63645,

2949961 total requests in 2 hours -> 409.716805556 req/sec

It used to crash with 25req/sec with the previous instance size.

Stats after two hours running the tracker:

{
  "torrents": 196455,
  "seeders": 120032,
  "completed": 1241,
  "leechers": 200406,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 1022134,
  "udp4_announces_handled": 1864182,
  "udp4_scrapes_handled": 63645,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}
josecelano commented 6 months ago

22:26

image

{
  "torrents": 344237,
  "seeders": 256095,
  "completed": 4079,
  "leechers": 420604,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 3313661,
  "udp4_announces_handled": 7467544,
  "udp4_scrapes_handled": 236375,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}
From 17:43 to 22:28 -> 17100 sec

"udp4_connections_handled": 3313661,
"udp4_announces_handled": 7467544,
"udp4_scrapes_handled": 236375,

11017580 total requests in 17100 sec -> 644.302923977 req/sec
josecelano commented 6 months ago

08:08.

The tracker has been restarted again.

24h graph:

image

Last-hour graph:

image

I could see how many requests it was handling before restarting. I want to know how many req/sec this instance can handle. It would be nice to collect statistics every 5 minutes. Maybe I can write a simple script to import them and update a csv file until we implement something like this. I can run it on the server.

josecelano commented 6 months ago

The tracker was restarted after running from 8:08 to 1:25.

Stats at 22:30:

Req/seg: 505.0198659 
Seeders+leechers = 1.390.088
{
  "torrents": 521222,
  "seeders": 497614,
  "completed": 8410,
  "leechers": 892474,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 7924576,
  "udp4_announces_handled": 17939819,
  "udp4_scrapes_handled": 497642,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}
top - 21:25:22 up 1 day,  6:42,  1 user,  load average: 2.73, 2.37, 2.19
Tasks: 141 total,   2 running, 128 sleeping,   0 stopped,  11 zombie
%Cpu(s): 24.2 us, 28.8 sy,  0.0 ni,  1.0 id, 37.4 wa,  0.0 hi,  7.0 si,  1.7 st
MiB Mem :   1963.9 total,     93.5 free,   1758.5 used,    111.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     69.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 683491 torrust   20   0 1733056   1.4g   3300 S  17.6  72.3 177:40.05 torrust-tracker
 683460 root      20   0  719640   3784      0 S  15.3   0.2 135:15.98 containerd-shim
    724 root      20   0 1537756  46328   2360 S  14.3   2.3 249:15.92 dockerd
   1296 torrust   20   0  582848  48888   5772 S   4.7   2.4  72:18.78 torrust-index

image

I think I can confirm the problem is the number of peers and torrents increases, consuming more memory until the server start swapping to much and the container is restarted due to the healthcheck.

I will open a discussion on the Tracker repo. We could limit the resources consumption by limiting the requests or memory but that would lead to worse responses. In a production environment I guess we should only monitor the resources and scale up. Maybe we could try a mixed solution:

  1. Limit resources consumption to avoid restarting the service.
  2. But motinor it, so we can scale up before the service is degraded.

In order to monitor this thing better it would be nice to have more statistics and show some graphs.

josecelano commented 6 months ago

Server resized to:

image

josecelano commented 6 months ago

On the 20th of April, the server was resized.

image

{
  "torrents": 1236779,
  "seeders": 1460674,
  "completed": 26980,
  "leechers": 2691284,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 24167291,
  "udp4_announces_handled": 58160995,
  "udp4_scrapes_handled": 1485013,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}

501.75 req/sec

josecelano commented 6 months ago

The new server was running (without restarting the tracker) from the 29th of April at 8:10 am to the 2nd of May at 5:30 am.

image