Overview

During the setup of initial data collection for the reputation system, we noticed that on the first day our EC2 instances, which were collecting the data, completely froze up.

Investigation

I started investigating the issue which I initially assumed might have been due to while loops that made our collection scripts continuously run (which was the goal). So, I modified the script to run once only and then have systemd run it every 30 seconds using a timer that waits for the previous run to finish before starting another one.

However, this did not solve the issue. Upon further investigation using htop, I can see that the memory used by yagna is increasing over time. Upon launch of Yagna, the EC2 instance was using a total of 300 MB of memory. Now after almost 2 hours, we are close to using 2 GB of memory. It's slowly increasing over time.

My suspicion is that it's the yagna net ping or yagna net find command that is causing this issue.

Scripts in Use

Uptime checker

https://github.com/golemfactory/reputation-auditor/tree/main/uptime

Here we collect offers from the network to check if a node is offline/online. If we previously received an offer from a node and it didn't send one in the past 30 seconds, then we use yagna net find to confirm if the node is offline or online.

Ping checker

https://github.com/golemfactory/reputation-auditor/tree/main/ping-checker

Here we simply acquire a list of online nodes from the stats page and use yagna net ping to check the latency between the nodes and us.

Setting Up the Data Collecting

There's a README included in each script, and at the bottom is the systemd config that's used. It assumes that you're using an Ubuntu EC2 instance to run it with.

golemfactory / yagna

Memory leak during either pinging/finding or receiving proposals. #2955