Closed ederenn closed 2 years ago
You've got a lot of ActivityCreateFailed
because of a timeout there.
I've already reported similar problem on 2022-02-28 (although there's no issue for it).
We need to find out, if there is any interference between devnet Providers. Activity creation fails in case, when ExeUnit starts to long
The issue is being closed, to be re-opened when necessary.
I have no longer encountered the issue since devnet providers have been restarted.
Before then, the investigation showed that the providers were operating under degraded performance conditions. The prevailing issue were the long database access times, which delayed each action taken by the provider agent (e.g. activity creation). Since the provider restart, further investigation was hampered and lead to examination of the following potential problems and symptoms:
exhausted hardware resources
There are multiple daemon and agent binaries running on each devnet machine. Resource utilization graphs in munin
showed that the usage during the testing session wasn't out of ordinary
dominant CPU usage by the hybrid net providers
Hybrid devnet providers were suspectible of taking the most CPU power for challenge validation in P2P communication; requesting tasks simultaneously on the beta and hybrid devnets has proven this suspicion to be false
hitting the open file descriptor limit
This investigation lead to the fix in the erc20 driver where the HTTP client connection pool is re-used by all web3 calls: https://github.com/golemfactory/yagna/pull/1892 . The fix is mainly targeting the requestor nodes and will not impact provider nodes as much
Currently, provider nodes are behaving correctly.
@mfranciszkiewicz It's happening again.
Side note, but maybe by any chance this is important:
Goth tests are becoming less and less stable in the recent days (e.g. yapapi
nightly), and I don't have any idea what might be the cause, as there are not much code changes.
Maybe this is somehow connected? Like e.g. "For some unknown reason the performance of the central image repository degrades in time, and because of this:
goth
CI tests fail because images are not downloaded (I don't know if this is indeed the case)devnet
providers have some other problems (e.g. try to ping the repo for whatever reason, this lasts too long, and causes a timeout)"Again, I don't have any good reason to believe this is the case, except that there are two weird things happening at the same time, so maybe there's a single weirdness behind them :)
It didn't happen since 0.10.1 release. @mfranciszkiewicz is it possible that it's fixed there?
Ping Marek F
@etam @EvilSeeQu-sys There was no specific fix targeting this issue, although it's possible that it has been fixed. I'm closing this issue, to be re-opened when necessary.
Name: blue yagna version: yagna 0.10.0-rc15 (160bc5a1 2022-03-14 build #206) OS+lang+version (if applicable): mac + Python 3.9.7 + yapapi 0.9.0a1
yagna_rCURRENT (6).log blender-yapapi-2022-03-15_12.47.09.log blender-yapapi-2022-03-15_12.39.26.log