golemfactory / clay

Golem is creating a global market for computing power.
https://golem.network
GNU General Public License v3.0
2.91k stars 286 forks source link

hello-gwasm-runner gets stuck after finishing several subtasks #5165

Open marmistrz opened 4 years ago

marmistrz commented 4 years ago

Description

Golem Version: 0.22.2

Golem-Messages version (leave empty if unsure):

Electron version (if used): N/A

OS: Ubuntu 18.04

Branch (if launched from source): develop

Mainnet/Testnet: testnet

Description of the issue:

I setup a subnet of nodes on one machine, which doesn't have a public IP address, as described in this wiki page

I created 1 requestor and 2 provider nodes. Then I used gwasm-runner to launch the hello-gwasm-runner task as:

gwasm-runner --backend=Brass target/wasm32-unknown-emscripten/release/hello_world.wasm

Then the workload is stuck at 7/10 progress. Adding additional seems to do the trick, the task succeeded with 4 provider nodes.

Those provider nodes were running for about 24h, then I tried to run the task again and the job was stuck at 7/10 again. Adding a fresh provider node did the trick again.

Logs and any additional context

$ for i in 1 2 3 4 5; do golemcli -d /home/marcin/golem/datadir$i/ -p 6100$i tasks subtasks list 8cd3f0be-82ea-11ea-8a50-6110940bba10; done
┌────────┬────────────────────────────────────────┬─────────────┬────────────┐
│  node  │  subtask id                            │  status     │  progress  │
├────────┼────────────────────────────────────────┼─────────────┼────────────┤
│        │  9841079e-82ea-11ea-a6fa-6110940bba10  │  Finished   │  100.0 %   │
│        │  98416ab0-82ea-11ea-b784-6110940bba10  │  Finished   │  100.0 %   │
│        │  9841e176-82ea-11ea-8ebd-6110940bba10  │  Finished   │  100.0 %   │
│        │  98423ba2-82ea-11ea-9094-6110940bba10  │  Finished   │  100.0 %   │
│        │  98429dc0-82ea-11ea-aaf3-6110940bba10  │  Finished   │  100.0 %   │
│        │  98430dc6-82ea-11ea-9887-6110940bba10  │  Finished   │  100.0 %   │
│        │  984368ba-82ea-11ea-8c1b-6110940bba10  │  Finished   │  100.0 %   │
│        │  9843e07e-82ea-11ea-bd8e-6110940bba10  │  Finished   │  100.0 %   │
│        │  98443bb8-82ea-11ea-b341-6110940bba10  │  Finished   │  100.0 %   │
│        │  9844ace4-82ea-11ea-8f1c-6110940bba10  │  Finished   │  100.0 %   │
│        │  9845120a-82ea-11ea-8e03-6110940bba10  │  Finished   │  100.0 %   │
│        │  9845885a-82ea-11ea-a0af-6110940bba10  │  Finished   │  100.0 %   │
│        │  9845e6ee-82ea-11ea-b7eb-6110940bba10  │  Finished   │  100.0 %   │
│        │  98465612-82ea-11ea-82ba-6110940bba10  │  Finished   │  100.0 %   │
│        │  9846b002-82ea-11ea-8e3f-6110940bba10  │  Verifying  │  0.0 %     │
│        │  984721f8-82ea-11ea-885f-6110940bba10  │  Verifying  │  0.0 %     │
│        │  98477db6-82ea-11ea-ab6b-6110940bba10  │  Verifying  │  0.0 %     │
│        │  a2e4b938-82ea-11ea-8e9d-6110940bba10  │  Timeout    │  0.0 %     │
│        │  a2e69ba2-82ea-11ea-ba7f-6110940bba10  │  Timeout    │  0.0 %     │
│        │  a2e89106-82ea-11ea-916e-6110940bba10  │  Timeout    │  0.0 %     │
│        │  13ef7d74-82ec-11ea-9999-6110940bba10  │  Starting   │  0.0 %     │
│        │  13f159ee-82ec-11ea-9187-6110940bba10  │  Starting   │  0.0 %     │
│        │  13f3468c-82ec-11ea-9d21-6110940bba10  │  Starting   │  0.0 %     │
└────────┴────────────────────────────────────────┴─────────────┴────────────┘
No subtasks
No subtasks
No subtasks
No subtasks

Excerpt from the requestor logs:

b09..a07693ff'
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=500, task_id='7aca90ac-824d-11ea-b476-6110940bba10'
INFO     [apps.wasm                          ] Node 1acb9b09d2d097819b1031ad7e918a8708bd0b24943900e1d704d1c968f22d7a3e5bff1e9a8a769468b11db46d6783358dfd0468fd14d0df6ee6adcea07693ff has been blacklisted for this task
INFO     [golem.task.taskserver              ] provider 1acb9b09d2d097819b1031ad7e918a8708bd0b24943900e1d704d1c968f22d7a3e5bff1e9a8a769468b11db46d6783358dfd0468fd14d0df6ee6adcea07693ff is not allowed for this task at this moment (either waiting for results or previously failed)
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='7aca90ac-824d-11ea-b476-6110940bba10', node='37c64f7a..d543ea5a'
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=501, task_id='7aca90ac-824d-11ea-b476-6110940bba10'
INFO     [apps.wasm                          ] Node 37c64f7acb93ce0cd899cd2dd5d8ab9619e794df8c0063b683ccce1c57ea8e7d906161f3fb32d1086d3da620a2eebaa8b2a677e0f6f82f93944b512dd543ea5a has been blacklisted for this task
INFO     [golem.task.taskserver              ] provider 37c64f7acb93ce0cd899cd2dd5d8ab9619e794df8c0063b683ccce1c57ea8e7d906161f3fb32d1086d3da620a2eebaa8b2a677e0f6f82f93944b512dd543ea5a is not allowed for this task at this moment (either waiting for results or previously failed)
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='7aca90ac-824d-11ea-b476-6110940bba10', node='1acb9b09..a07693ff'
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=502, task_id='7aca90ac-824d-11ea-b476-6110940bba10'
INFO     [apps.wasm                          ] Node 1acb9b09d2d097819b1031ad7e918a8708bd0b24943900e1d704d1c968f22d7a3e5bff1e9a8a769468b11db46d6783358dfd0468fd14d0df6ee6adcea07693ff has been blacklisted for this task
INFO     [golem.task.taskserver              ] provider 1acb9b09d2d097819b1031ad7e918a8708bd0b24943900e1d704d1c968f22d7a3e5bff1e9a8a769468b11db46d6783358dfd0468fd14d0df6ee6adcea07693ff is not allowed for this task at this moment (either waiting for results or previously failed)