golemfactory / yagna-triage

Repository for issues which we don't yet know about enough to assign to proper repo
1 stars 0 forks source link

SCT Investigate removed network when running dapp longer than 30 minutes #242

Closed jalas167 closed 1 year ago

jalas167 commented 1 year ago

@SDK , we have an interesting case in production observed by @broadcastmonkey Paweł Burgchardt , the application is currently running on Golem (according to dapp-manager) - https://291f9f8df0964af8854dfe733ab8b8f2.portal.golem.network/

The app was deployed, not used for 30+ minutes, and then the link was used. Leads to HTTP 500 on for the users. I took a look on the stderr of this app and I got the following information (next message).

Can you please elaborate on what's wrong?

The timeline (referring to the timestamps in the logs)

The logs

Related logs on the provider:

https://yastats.golem.network/explore?orgId=1&left=%7B%22datasource%22:%22iT4zPPcGz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22iT4zPPcGz%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22%7Bhostname%3D%5C%22fractal_01_0.h%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221681287145000%22,%22to%22:%221681290805000%22%7D%7D

Related logs on the requestor:

https://yastats.golem.network/explore?orgId=1&left=%7B%22datasource%22:%22iT4zPPcGz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22iT4zPPcGz%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22%7Bhostname%3D%5C%22portal.golem.network%5C%22,%20service%3D%5C%22yagna%5C%22%7D%20%21%3D%20%5C%22driver::cron%5C%22%20%21%3D%20%5C%22erc20::wallet%5C%22%20%21%3D%20%5C%22not%20responding%20to%20ping%5C%22%20%21%3D%20%5C%22ya_market::negotiation%5C%22%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221681287200000%22,%22to%22:%221681290804000%22%7D%7D

PLEASE NOTE THAT THIS IS THE GOLEM PORTAL REQUESTOR, SO YOU WILL SEE LOGS FROM ALL ACTIVITY, NOT ONLY THE ONE AFFECTED.

Within this timeframe I was able to find one interesting entry:

[2023-04-12T08:41:56.893+0000 WARN  ya_vpn::network] Stopping VPN 8b9907011f664d1992b66ec254118175

But it's not the one which was used by Paweł's instance. So I decided to check the VPN related logs:

$ cat yagna_r2023-04-12_00-00-00.log  | grep VPN
[2023-04-12T05:16:40.815+0000 INFO  ya_vpn::network] VPN 0d81c00f56f24d5ab353f681d17fe75e started
[2023-04-12T06:17:04.091+0000 WARN  ya_vpn::network] Stopping VPN 0d81c00f56f24d5ab353f681d17fe75e
[2023-04-12T07:40:59.788+0000 INFO  ya_vpn::network] VPN 8b9907011f664d1992b66ec254118175 started
[2023-04-12T08:12:28.082+0000 INFO  ya_vpn::network] VPN 79e859796e5c40f5a59657d8bc5badd2 started
[2023-04-12T08:41:56.893+0000 WARN  ya_vpn::network] Stopping VPN 8b9907011f664d1992b66ec254118175
[2023-04-12T12:18:08.865+0000 INFO  ya_vpn::network] VPN 1b0cc7dab1bc4e718222cc677efd7407 started
[2023-04-12T12:19:58.417+0000 WARN  ya_vpn::network] Stopping VPN 1b0cc7dab1bc4e718222cc677efd7407
[2023-04-12T13:23:18.778+0000 INFO  ya_vpn::network] VPN 7418f1dd90684c2c806e9827c0bd7d62 started
[2023-04-12T13:23:27.942+0000 INFO  ya_vpn::network] VPN 494ea14e14ed499798284fa905a323c6 started
[2023-04-12T13:28:19.303+0000 WARN  ya_vpn::network] Stopping VPN 7418f1dd90684c2c806e9827c0bd7d62

Here you can see that 79e859796e5c40f5a59657d8bc5badd2 has been created 2023-04-12T08:12:28.082+0000 and it was never stopped (at least according to yagna's logs).

Using this information I checked back the provider logs and was able to find:

2023-04-12 10:13:24 | [2023-04-12T08:13:22.624+0000 INFO  ya_exe_unit::network::vpn] Starting VPN service... |  
-- | -- | --
  |   | 2023-04-12 10:13:24 | [2023-04-12T08:13:22.625+0000 DEBUG ya_runtime_api::server::client] sending request: Request { id: 3, command: Some(Network(CreateNetwork { networks: [Network { addr: "192.168.0.0", gateway: "192.168.0.1", mask: "255.255.255.0", if_addr: "192.168.0.4" }], hosts: {}, interface: Vpn })) } |  
  |   | 2023-04-12 10:13:24 | [2023-04-12T08:13:22.625+0000 DEBUG ya_runtime_api::server::client] waiting for response |  
  |   | 2023-04-12 10:13:24 | [2023-04-12T08:13:23.338+0000 DEBUG ya_runtime_api::server::client] got response: Response { event: false, id: 3, command: Some(Network(CreateNetwork { endpoint: Some(UdpDatagram("127.0.0.1:55567")) })) }

This looks like the VPN connection establishing really failed? Another line in the provider logs points to, so it seems that the VPN was actually working?

[2023-04-12T08:13:23.338+0000 INFO  ya_exe_unit::network::vpn] [vpn] service started
nieznanysprawiciel commented 1 year ago

Investigation result (meeting 14.04 with @grisha87 @lucekdudek and @nieznanysprawiciel ): Problem can be reproduced by creating 2 applications at the same time. After one application is closed, both networks disappear. Yagna internally during removing network clears ownership map, but instead of removing only one network, it removes whole key for specific identity (But network itself isn't removed). Following calls referencing this network will fail, because identity is not the owner of the network

nieznanysprawiciel commented 1 year ago

RC with fixes is here: https://github.com/golemfactory/yagna/releases/tag/pre-rel-v0.13.0-rc3