golemfactory / yagna

An open platform and marketplace for distributed computations
GNU General Public License v3.0
382 stars 60 forks source link

Memory leak during either pinging/finding or receiving proposals. #2955

Open cryptobench opened 8 months ago

cryptobench commented 8 months ago

Overview

During the setup of initial data collection for the reputation system, we noticed that on the first day our EC2 instances, which were collecting the data, completely froze up.

Investigation

I started investigating the issue which I initially assumed might have been due to while loops that made our collection scripts continuously run (which was the goal). So, I modified the script to run once only and then have systemd run it every 30 seconds using a timer that waits for the previous run to finish before starting another one.

However, this did not solve the issue. Upon further investigation using htop, I can see that the memory used by yagna is increasing over time. Upon launch of Yagna, the EC2 instance was using a total of 300 MB of memory. Now after almost 2 hours, we are close to using 2 GB of memory. It's slowly increasing over time.

My suspicion is that it's the yagna net ping or yagna net find command that is causing this issue.

image

Scripts in Use

Uptime checker

https://github.com/golemfactory/reputation-auditor/tree/main/uptime

Here we collect offers from the network to check if a node is offline/online. If we previously received an offer from a node and it didn't send one in the past 30 seconds, then we use yagna net find to confirm if the node is offline or online.

Ping checker

https://github.com/golemfactory/reputation-auditor/tree/main/ping-checker

Here we simply acquire a list of online nodes from the stats page and use yagna net ping to check the latency between the nodes and us.

Setting Up the Data Collecting

There's a README included in each script, and at the bottom is the systemd config that's used. It assumes that you're using an Ubuntu EC2 instance to run it with.

nieznanysprawiciel commented 8 months ago

Yagna doesn't have mechanisms to limit number of connections. They are closed after some period of time, when unused, but here you are trying to establish connections with whole network in short period of time.

First thing I would try, is to close connections made during pinging after each chunk is processed: yagna net disconnect {node_id}