Most/all agreements fail when running against devnet-beta

ederenn commented 2 years ago

Name: blue yagna version: yagna 0.10.0-rc15 (160bc5a1 2022-03-14 build #206) OS+lang+version (if applicable): mac + Python 3.9.7 + yapapi 0.9.0a1

$ python blender.py
[2022-03-15T12:47:09.599+0100 INFO yapapi] Using log file `blender-yapapi-2022-03-15_12.47.09.log`; in case of errors look for additional information there
yapapi version: 0.9.0a1
Using subnet: devnet-beta, payment driver: erc20, and network: rinkeby

[2022-03-15T12:47:12.022+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'etam'
[2022-03-15T12:47:12.276+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'etam'
[2022-03-15T12:47:12.701+0100 INFO yapapi.summary] Received proposals from 6 providers so far
[2022-03-15T12:47:14.294+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'omx-testnode1'
[2022-03-15T12:47:14.665+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 0
[2022-03-15T12:47:15.706+0100 INFO yapapi.summary] Received proposals from 12 providers so far
[2022-03-15T12:47:17.821+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'omx-testnode1'
[2022-03-15T12:47:19.833+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'bristlecone'
[2022-03-15T12:47:20.181+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'bristlecone'
[2022-03-15T12:47:24.379+0100 INFO yapapi.summary] [Job 1] Terminated agreement with omx-testnode1
[2022-03-15T12:47:24.388+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'sycamore'
[2022-03-15T12:47:27.671+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'sycamore'
[2022-03-15T12:47:27.721+0100 INFO yapapi.summary] Received proposals from 13 providers so far
Task computed: Task(id=1, data=0), result: output_0.png, time: 0:00:13.781179
[2022-03-15T12:47:28.454+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 0
[2022-03-15T12:47:28.455+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 10
[2022-03-15T12:47:29.744+0100 INFO yapapi.summary] [Job 1] Terminated agreement with bristlecone
[2022-03-15T12:47:29.757+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'omx-testnode1'
[2022-03-15T12:47:33.297+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'omx-testnode1'
[2022-03-15T12:47:35.370+0100 INFO yapapi.summary] [Job 1] Terminated agreement with sycamore
[2022-03-15T12:47:35.380+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'glen'
[2022-03-15T12:47:38.673+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'glen'
[2022-03-15T12:47:40.860+0100 INFO yapapi.summary] [Job 1] Terminated agreement with omx-testnode1
Task computed: Task(id=2, data=10), result: output_10.png, time: 0:00:13.875980
[2022-03-15T12:47:42.323+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 10
[2022-03-15T12:47:42.323+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 20
[2022-03-15T12:47:42.874+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'michal'
[2022-03-15T12:47:46.115+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'michal'
[2022-03-15T12:47:48.196+0100 INFO yapapi.summary] [Job 1] Terminated agreement with glen
[2022-03-15T12:47:48.208+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'omx-testnode1'
[2022-03-15T12:47:51.783+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'omx-testnode1'
[2022-03-15T12:47:53.903+0100 INFO yapapi.summary] [Job 1] Terminated agreement with michal
[2022-03-15T12:47:53.920+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'bristlecone'
Task computed: Task(id=3, data=20), result: output_20.png, time: 0:00:13.723595
[2022-03-15T12:47:56.048+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 20
[2022-03-15T12:47:56.048+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 30
[2022-03-15T12:47:57.262+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'bristlecone'
[2022-03-15T12:47:59.484+0100 INFO yapapi.summary] [Job 1] Terminated agreement with omx-testnode1
[2022-03-15T12:47:59.500+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'michal'
[2022-03-15T12:48:02.742+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'michal'
[2022-03-15T12:48:04.843+0100 INFO yapapi.summary] [Job 1] Terminated agreement with bristlecone
[2022-03-15T12:48:08.976+0100 INFO yapapi.summary] [Job 1] Terminated agreement with michal
[2022-03-15T12:48:08.989+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'omx-testnode1'
Task computed: Task(id=4, data=30), result: output_30.png, time: 0:00:13.761870
[2022-03-15T12:48:09.813+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 30
[2022-03-15T12:48:09.814+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 40
[2022-03-15T12:48:12.532+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'omx-testnode1'
[2022-03-15T12:48:14.546+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'michal'
[2022-03-15T12:48:17.865+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'michal'
[2022-03-15T12:48:20.058+0100 INFO yapapi.summary] [Job 1] Terminated agreement with omx-testnode1
[2022-03-15T12:48:20.077+0100 INFO yapapi.summary] [Job 1] Agreement proposed to provider 'q53'
[2022-03-15T12:48:23.278+0100 INFO yapapi.summary] [Job 1] Agreement confirmed by provider 'q53'
Task computed: Task(id=5, data=40), result: output_40.png, time: 0:00:13.721526
[2022-03-15T12:48:23.530+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 40
[2022-03-15T12:48:23.531+0100 INFO yapapi.summary] [Job 1] Task started on provider 'etam', task data: 50
[2022-03-15T12:48:25.364+0100 INFO yapapi.summary] [Job 1] Terminated agreement with michal
Task computed: Task(id=6, data=50), result: output_50.png, time: 0:00:13.706605
[2022-03-15T12:48:37.238+0100 INFO yapapi.summary] [Job 1] Task finished by provider 'etam', task data: 50
[2022-03-15T12:48:38.242+0100 INFO yapapi.executor] [Job 1] Waiting for 2 workers to finish...
[2022-03-15T12:48:42.507+0100 INFO yapapi.executor] [Job 1] Waiting for Executor services to finish...
6 tasks computed, total time: 0:01:32.510203
[2022-03-15T12:48:42.508+0100 INFO yapapi.executor] Golem is shutting down...
[2022-03-15T12:48:42.509+0100 INFO yapapi.summary] [Job 1] Job finished in 92.5s
[2022-03-15T12:48:42.510+0100 INFO yapapi.summary] [Job 1] Negotiated 13 agreements with 7 providers
[2022-03-15T12:48:42.510+0100 INFO yapapi.summary] [Job 1] Provider 'etam' computed 6 tasks
[2022-03-15T12:48:42.510+0100 INFO yapapi.summary] [Job 1] Provider 'omx-testnode1' did not compute any tasks
[2022-03-15T12:48:42.510+0100 INFO yapapi.summary] [Job 1] Provider 'glen' did not compute any tasks
[2022-03-15T12:48:42.511+0100 INFO yapapi.summary] [Job 1] Provider 'michal' did not compute any tasks
[2022-03-15T12:48:42.511+0100 INFO yapapi.summary] [Job 1] Provider 'sycamore' did not compute any tasks
[2022-03-15T12:48:42.511+0100 INFO yapapi.summary] [Job 1] Provider 'michal' did not compute any tasks
[2022-03-15T12:48:42.511+0100 INFO yapapi.summary] [Job 1] Provider 'michal' did not compute any tasks
[2022-03-15T12:48:42.511+0100 INFO yapapi.summary] [Job 1] Provider 'omx-testnode1' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.summary] [Job 1] Provider 'omx-testnode1' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.summary] [Job 1] Provider 'omx-testnode1' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.summary] [Job 1] Provider 'bristlecone' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.summary] [Job 1] Provider 'q53' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.summary] [Job 1] Provider 'bristlecone' did not compute any tasks
[2022-03-15T12:48:42.512+0100 INFO yapapi.executor] All jobs have finished
[2022-03-15T12:48:42.513+0100 INFO yapapi.executor] 2 agreements still unpaid, waiting for invoices...
[2022-03-15T12:48:44.353+0100 INFO yapapi.summary] [Job 1] Accepted invoice from 'etam', amount: 0.01080197226300000
[2022-03-15T12:49:12.515+0100 WARNING yapapi.executor] Unpaid agreements: {'c26d964bb44c053e616dc68349851a0d98553d8939fbc05bf3e7543c6f44bd09'}
[2022-03-15T12:49:12.516+0100 INFO yapapi.executor] Waiting for Golem services to finish...
[2022-03-15T12:49:12.537+0100 INFO yapapi.summary] Total cost: 0.010801972263
[2022-03-15T12:49:12.537+0100 INFO yapapi.summary] Golem engine has shut down

yagna_rCURRENT (6).log blender-yapapi-2022-03-15_12.47.09.log blender-yapapi-2022-03-15_12.39.26.log

johny-b commented 2 years ago

You've got a lot of ActivityCreateFailed because of a timeout there. I've already reported similar problem on 2022-02-28 (although there's no issue for it).

nieznanysprawiciel commented 2 years ago

We need to find out, if there is any interference between devnet Providers. Activity creation fails in case, when ExeUnit starts to long

mfranciszkiewicz commented 2 years ago

The issue is being closed, to be re-opened when necessary.

I have no longer encountered the issue since devnet providers have been restarted.

Before then, the investigation showed that the providers were operating under degraded performance conditions. The prevailing issue were the long database access times, which delayed each action taken by the provider agent (e.g. activity creation). Since the provider restart, further investigation was hampered and lead to examination of the following potential problems and symptoms:

exhausted hardware resources

There are multiple daemon and agent binaries running on each devnet machine. Resource utilization graphs in munin showed that the usage during the testing session wasn't out of ordinary
dominant CPU usage by the hybrid net providers

Hybrid devnet providers were suspectible of taking the most CPU power for challenge validation in P2P communication; requesting tasks simultaneously on the beta and hybrid devnets has proven this suspicion to be false
hitting the open file descriptor limit

This investigation lead to the fix in the erc20 driver where the HTTP client connection pool is re-used by all web3 calls: https://github.com/golemfactory/yagna/pull/1892 . The fix is mainly targeting the requestor nodes and will not impact provider nodes as much

Currently, provider nodes are behaving correctly.

etam commented 2 years ago

@mfranciszkiewicz It's happening again.

johny-b commented 2 years ago

Side note, but maybe by any chance this is important:

Goth tests are becoming less and less stable in the recent days (e.g. yapapi nightly), and I don't have any idea what might be the cause, as there are not much code changes.

Maybe this is somehow connected? Like e.g. "For some unknown reason the performance of the central image repository degrades in time, and because of this:

goth CI tests fail because images are not downloaded (I don't know if this is indeed the case)
devnet providers have some other problems (e.g. try to ping the repo for whatever reason, this lasts too long, and causes a timeout)"

Again, I don't have any good reason to believe this is the case, except that there are two weird things happening at the same time, so maybe there's a single weirdness behind them :)

etam commented 2 years ago

Screenshot_20220511_112534

etam commented 2 years ago

It didn't happen since 0.10.1 release. @mfranciszkiewicz is it possible that it's fixed there?

EvilSeeQu-sys commented 2 years ago

Ping Marek F

mfranciszkiewicz commented 2 years ago

@etam @EvilSeeQu-sys There was no specific fix targeting this issue, although it's possible that it has been fixed. I'm closing this issue, to be re-opened when necessary.

golemfactory / yagna-triage

Most/all agreements fail when running against devnet-beta #196