tarantool / cartridge

Out-of-the-box cluster manager for Tarantool with a modern web UI
https://www.tarantool.io/en/cartridge/
BSD 2-Clause "Simplified" License
94 stars 29 forks source link

Memory Leak? #1840

Closed jadawil closed 1 year ago

jadawil commented 2 years ago

Following the example here https://www.tarantool.io/en/doc/latest/getting_started/getting_started_cartridge/ I set up two replicasets of 3 nodes, one for storage and one for router. Over the period of 24 hours I was able to watch the memory increase on all processes from 20mb to over 100mb. This looks like it will continue growing.

In our production environment we see some nodes grow in memory until they eventually crash and restart at around 1.5gb.

Seen on Macos and Alpine Tarantool 2.8 and 2.10 Cartridge 2.7.3 and 2.7.4

filonenko-mikhail commented 2 years ago

The first task:

Env:

After 24hours uptime all nodes is aroung 120mb rss. So, i think i have to wait more 24hours to understand is it increasing.

The second task:

In our production environment we see some nodes grow in memory until they eventually crash and restart at around 1.5gb.

Could you please provide some artifacts about crashing:

Please check your lua code. It is possible to accidentaly make fullscan using box.space.....select(nil) construction.

filonenko-mikhail commented 2 years ago

After several days uptime memory usage is around 200mb for every instance.

collectgarbage() collectgarbage() have no effect.

Continue watching.

jadawil commented 2 years ago

Hey,

I'll try to get you some more information on this later in the week. What I can tell you is we don't appear to be having this issue on all nodes in our production environment, we are seeing the worst cases on our router nodes with some increasing by 150mb in 24 hours. We have three different roles for our router nodes, the role we are seeing the worst leaks on are just providing a basic api that calls to the storage nodes via vshard.router.call. The other router roles also use vshard.router.call but at a much reduced rate. Maybe there is something in that?

filonenko-mikhail commented 2 years ago

Yes it canbe in some role code. Could you please If it possible share some api code.

Also please check using of box.space....:select() or box.space....:pairs() without keys (or key is nil) because such call is fullscan.

jadawil commented 2 years ago

Here are the rss graphs for 2 sets of router nodes, these have different roles but have the same gradual increase in memory.

Screenshot 2022-06-22 at 08 29 13

I can't share any code or logs but I'll try to make another test example soon. There are no full space scans happening, I was just suggesting that there could be an issue with the vshard router calling.

jadawil commented 2 years ago

Looking at the fiber.info for one of the above router nodes. It seems these applier fibers are occupying a large amount of memory, especially since these are just routers. I'll keep an eye on these to see if they are increasing.

133: csw: 1192416 memory: total: 129048952 used: 124010976 time: 0 name: applier/admin@{URL} fid: 133

134: csw: 1192409 memory: total: 129044856 used: 124009312 time: 0 name: applier/admin@{URL} fid: 134

jadawil commented 2 years ago

After 3 days

133: csw: 1464339 memory: total: 158359928 used: 152290968 time: 0 name: applier/admin@{URL} fid: 133 134: csw: 1464325 memory: total: 158355832 used: 152288576 time: 0 name: applier/admin@{URL} fid: 134

After 6 days

133: csw: 1709744 memory: total: 184811896 used: 177813088 time: 0 name: applier/admin@{URL} fid: 133 134: csw: 1709721 memory: total: 184807800 used: 177809760 time: 0 name: applier/admin@{URL} fid: 134

filonenko-mikhail commented 2 years ago

So, it seems that applier is the possible place of memleak. Are there business traffic on such case or it's just started cartridge without any load?

jadawil commented 2 years ago

In these cases there is business traffic, this is from our prod environment. I have not retested the cartridge with no load to check these fibers. Do you still have yours running?

filonenko-mikhail commented 2 years ago

I still have: near 450mb every node rss memory.

jadawil commented 2 years ago

Ok, but in your case the nodes are doing nothing so 450mb is a lot.

filonenko-mikhail commented 2 years ago

And more and more 580mb. Found that applier fiber use about 250mb

filonenko-mikhail commented 2 years ago

The case found. If there is no replication event, only pings, they collected in fiber gc region and fiber gc not triggered. This case fixed in 2.10. Could you please update your cluster?

jadawil commented 2 years ago

We already have 2.10 deployed to our dev environment and have observed the same memory increase. Plus when I was doing testing locally this was with 2.10 and 2.8.

Looks like 2.10 was released 2022-05-22, did the fix you're looking at definitely make it in?

filonenko-mikhail commented 2 years ago

This commit fixes https://github.com/tarantool/tarantool/commit/dacbf708f4a4cf0a130a4733982b69955c611718

Seems that 2.10.0-rc1 have to be healthed

jadawil commented 2 years ago

Ok, we will be deploying 2.10 in the next few weeks but when I did my originally testing for this issue it was with tarantool 2.10 and cartridge 2.7.4 so this is not our problem.

filonenko-mikhail commented 2 years ago

any info ?

jadawil commented 2 years ago

Yes, as I said the original test I did was with two replicasets of 3 nodes, one for storage and one for router on tarantool 2.10 and that still shows the issue.

Are you unable to reproduce it on 2.10?

jadawil commented 2 years ago

Is there anything else we can provide to help diagnose this?

filonenko-mikhail commented 2 years ago

Still no sorry. Will try to reproduce in some days.

AlexandrLitkevich commented 1 year ago

Hi guys, I tried to reproduce the leak described above, but it didn't work out. Since the creation of the cluster and within 3 days, I have not found any changes in the memory consumed centos 7 tarantool 2.10.2-0-gb924f0b cartridge 2.12.2

filonenko-mikhail commented 1 year ago

ticket ttl expired. As much as possible gently closing. Feel free to reopen.