Closed josecelano closed 5 months ago
In my PC, with the current interval (100 milliseconds):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1342564 josecel+ 20 0 2355216 38912 29440 S 20.2 0.1 1:26.22 torrust-index
With 1000 milliseconds:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1346972 josecel+ 20 0 2355216 38656 29696 S 2.3 0.1 0:03.38 torrust-index
With 2000 milliseconds:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1348942 josecel+ 20 0 2353168 37888 29952 S 1.3 0.1 0:00.59 torrust-index
With 3000 milliseconds:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1350156 josecel+ 20 0 2353288 37632 29696 S 1.0 0.1 0:00.61 torrust-index
With 4000 milliseconds:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1351449 josecel+ 20 0 2355216 38400 29440 S 0.7 0.1 0:01.23 torrust-index
With 5000 milliseconds:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1353683 josecel+ 20 0 2355224 37632 29440 S 1.0 0.1 0:02.17 torrust-index
Interval (secs) | Number of torrents imported per hour |
---|---|
1 sec | 50 * 3600 = 180000 |
2 sec | 50 * (3600/2) = 90000 |
3 sec | 50 * (3600/3) = 60000 |
4 sec | 50 * (3600/4) = 45000 |
5 sec | 50 * (3600/5) = 36000 |
I've deployed a new version with the increased interval 2 hours ago and it hasn't crashed:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3721 torrust 20 0 750016 499496 0 S 20.5 50.9 23:08.13 torrust-tracker
3867 torrust 20 0 577576 47760 6164 S 6.3 4.9 6:18.35 torrust-index
The problem is not solved yet :-/:
It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:
The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ee67d0951541 nginx:mainline-alpine "/docker-entrypoint.…" 14 hours ago Up 14 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp proxy
31c65c66be26 torrust/index-gui:develop "/usr/local/bin/entr…" 14 hours ago Up 14 hours (healthy) 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp index-gui
a8b2e8977c5c torrust/index:develop "/usr/local/bin/entr…" 14 hours ago Up 14 hours (healthy) 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp index
0cc1eefec6b6 torrust/tracker:develop "/usr/local/bin/entr…" 14 hours ago Up 10 minutes (healthy) 0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp tracker
I'm going to try first switching back to the previous repository implementation with the BTreeMap (there could be some deadlock with the SkiMap). If we still have the problem, we need to limit requests and/or memory consumption.
On the other hand, I think we should improve the container healthchecks. We should not wait for 2 hours to restart the container.
I've opened another issue to discover why there are so many zombie processes. I think there is an easy way to know if the healthcheks cause the crash on the server or not. I can run a private server with the same resources without requests.
Another thing I could check is the number of requests the tracker is handling. Maybe the number of requests per second has increased a lot, and the problem is the server can't handle those requests. In that case, the only solution would be:
5 minutes after restarting the tracker container, I checked the stats:
udp4_connections_handled: 44661,
udp4_announces_handled: 25175,
udp4_scrapes_handled: 652,
The tracker is handling 234.96 req/sec. I think, in this case, the server could be simply busy, and we need to implement one of the solutions above (1 to 3).
cc @da2ce7
UDPATE:
Those are only the tracker client requests. We also have tracker API, Index, stats importer, ...
It looks like the period without restarting is longer during the night. Maybe it's because there are fewer tracker requests.
Now there are 350 req/sec
In 2 hours
udp4_connections_handled: 824422,
udp4_announces_handled: 1646227,
udp4_scrapes_handled: 55240,
2525889 -> 350.817916667 req/sec
It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:
- The new tracker statistics importer (even with the new 2-second interval).
- The new torrent repository implementation with the SkipMap. I don't think so because, in the previous implementation with BTreeMap, we were not limiting the memory consumption either.
- The tracker is receiving too many requests, so we need to limit the number of requests or the memory consumption.
The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):
I've been thinking again about why this is happening now. Regarding the SkipMap
, maybe the SkipMap is consuming more memory than the BTreeMap
and thas making the container to restart at the current load level.
Regarding the load level, It would be nice to have historical statistics. I would like to know if we have more requests than one or two weeks ago. That would help in understanding if the problem is we simply have more requests and we need to resize the server.
I'm going to resize the droplet from memory 1GB ($6) to 2GB ($12) to see what happens:
Just after resizing the droplet:
The server was restarted at 15:45 approx (now 17:43):
After restarting the server, the tracker has handled:
udp4_connections_handled: 1022134,
udp4_announces_handled: 1864182,
udp4_scrapes_handled: 63645,
2949961 total requests in 2 hours -> 409.716805556 req/sec
It used to crash with 25req/sec with the previous instance size.
Stats after two hours running the tracker:
{
"torrents": 196455,
"seeders": 120032,
"completed": 1241,
"leechers": 200406,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 1022134,
"udp4_announces_handled": 1864182,
"udp4_scrapes_handled": 63645,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
22:26
{
"torrents": 344237,
"seeders": 256095,
"completed": 4079,
"leechers": 420604,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 3313661,
"udp4_announces_handled": 7467544,
"udp4_scrapes_handled": 236375,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
From 17:43 to 22:28 -> 17100 sec
"udp4_connections_handled": 3313661,
"udp4_announces_handled": 7467544,
"udp4_scrapes_handled": 236375,
11017580 total requests in 17100 sec -> 644.302923977 req/sec
08:08.
The tracker has been restarted again.
24h graph:
Last-hour graph:
I could see how many requests it was handling before restarting. I want to know how many req/sec this instance can handle. It would be nice to collect statistics every 5 minutes. Maybe I can write a simple script to import them and update a csv file until we implement something like this. I can run it on the server.
The tracker was restarted after running from 8:08 to 1:25.
Stats at 22:30:
Req/seg: 505.0198659
Seeders+leechers = 1.390.088
{
"torrents": 521222,
"seeders": 497614,
"completed": 8410,
"leechers": 892474,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 7924576,
"udp4_announces_handled": 17939819,
"udp4_scrapes_handled": 497642,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
top - 21:25:22 up 1 day, 6:42, 1 user, load average: 2.73, 2.37, 2.19
Tasks: 141 total, 2 running, 128 sleeping, 0 stopped, 11 zombie
%Cpu(s): 24.2 us, 28.8 sy, 0.0 ni, 1.0 id, 37.4 wa, 0.0 hi, 7.0 si, 1.7 st
MiB Mem : 1963.9 total, 93.5 free, 1758.5 used, 111.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 69.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
683491 torrust 20 0 1733056 1.4g 3300 S 17.6 72.3 177:40.05 torrust-tracker
683460 root 20 0 719640 3784 0 S 15.3 0.2 135:15.98 containerd-shim
724 root 20 0 1537756 46328 2360 S 14.3 2.3 249:15.92 dockerd
1296 torrust 20 0 582848 48888 5772 S 4.7 2.4 72:18.78 torrust-index
I think I can confirm the problem is the number of peers and torrents increases, consuming more memory until the server start swapping to much and the container is restarted due to the healthcheck.
I will open a discussion on the Tracker repo. We could limit the resources consumption by limiting the requests or memory but that would lead to worse responses. In a production environment I guess we should only monitor the resources and scale up. Maybe we could try a mixed solution:
In order to monitor this thing better it would be nice to have more statistics and show some graphs.
Server resized to:
On the 20th of April, the server was resized.
{
"torrents": 1236779,
"seeders": 1460674,
"completed": 26980,
"leechers": 2691284,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 24167291,
"udp4_announces_handled": 58160995,
"udp4_scrapes_handled": 1485013,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
501.75 req/sec
The new server was running (without restarting the tracker) from the 29th of April at 8:10 am to the 2nd of May at 5:30 am.
The droplet has started crashing, probably after the last update.
Just after rebooting:
After a while:
It looks like the Index is using a lot of CPU.
I don't know the reason yet. I think this recent PR could cause it:
https://github.com/torrust/torrust-index/pull/530
Maybe the execution interval (100 milliseconds) is too short for this machine.
cc @da2ce7