raiden-network / raiden

Raiden Network

https://developer.raiden.network

Other

1.84k stars 376 forks source link

Payment couldn't be completed because: there is no route available PFS5 #4927

Closed agatsoh closed 4 years ago

agatsoh commented 5 years ago

Description

Seems the node is not getting any routes from the PFS for the target node.

f'HTTP status code "{resp.status_code}" while fetching {url}. '
scenario_player.exceptions.legacy.RESTAPIStatusMismatchError: HTTP status code "409" while fetching http://127.0.0.1:41163/api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x536b7368575301950A00E0B945f75D22AD64534e. Expected 200: {"errors": "Payment couldn't be completed because: there is no route available"}

scenario-player-run_pfs5_too_low_capacity_2019-09-13T13:08:38.log

node_logs_0_1_2_3_4.zip

A snapshot of the console of the SP from my local machine https://gist.github.com/agatsoh/f53f16372f87e672c21ec4989b8ea845

Edit: Raiden commit

(raiden_env) [krishna@krishna-pc raiden]$ git log
commit bb0d6563ba5328399e303c74c00ed68bb210535b (HEAD -> develop, upstream/develop)
Author: Yoichi Hirai <yoichi@brainbot.com>
Date:   Thu Sep 12 18:36:43 2019 +0200

    Fixes on @LefterisJP's comments

SP commit

(raiden_env) [krishna@krishna-pc scenario-player]$ git log
commit 272da93ef08d53d176e18ef261a0952185912244 (HEAD -> dev, upstream/dev)
Merge: f28b481 c54466e
Author: Nils Diefenbach <nlsdfnbch.foss@kolabnow.com>
Date:   Fri Sep 13 08:58:44 2019 +0200

    [FIX-#319] Respect token.address scenario setting

Expected Behavior

The node should be able to find the correct path. The PFS should reply with the correct path.

czepluch commented 5 years ago

For me this scenario passes locally. Will try to run it some more times and see if I can reproduce. Maybe it's flaky. Might be related to https://github.com/orgs/raiden-network/projects/12#card-26353251 that also seems to have shown flaky behaviour.

czepluch commented 5 years ago

It was indeed the first assert_pfs_history that failed. So my assumption would be that the PFS acted wrongly. I will look into it.

agatsoh commented 5 years ago

Just checked for me fails all the time, I run the scenario like this

scenario_player --chain=goerli:http://10.104.6.13:8545 run --no-ui --keystore-file=/home/krishna/.ethereum/keystore/UTC--2018-10-12T07-01-18.438476520Z--8f2e0940bed6f90f1cb14feb37f045bb79c41b2d --password=${KEYSTORE_PW} /home/krishna/raidenforked/raiden/raiden/tests/scenarios/pfs5_too_low_capacity.yaml

czepluch commented 5 years ago

Took a look at the logs and also had @palango take a look. And it looks like nodes are UNREACHABLE or offline when the payment takes place. I have also managed to run into this problem locally myself now. Looks like a matrix problem. @ulope can you take a look when you're back from vacation? Or maybe @err508 has time?

czepluch commented 5 years ago

This doesn't just happen for the specific scenario. I have experienced it for several different ones, which points even further towards a transport/matrix problem.

czepluch commented 5 years ago

@ulope may I assign this to you? I know you're looking into it.

christianbrb commented 5 years ago

@andrevmatos also experienced this issue. A restart of the Raiden Node fixed it.

czepluch commented 5 years ago

@ulope Here are all the logs from a run that failed.

ms1_node21_fails_logs.zip

palango commented 5 years ago

I think I ran onto this as well yesterday. So if the PFS logs help you I can upload them.

andrevmatos commented 5 years ago

This seems related to matrix transport. I've seen some python client I had online appear as offline to the light-client, even if it continued to operate normally. Couldn't notice anything in the logs, but it seems like the sync stopped and the servers set the presence of the node as offline. If that's the cause, PFS is also not going to find the route because it'll see the node as offline in matrix. Restarting the node seems to fix it because it restarts the client/transport and the presence event is picked up again by the server and broadcast to the peers.

ulope commented 4 years ago

Current working theory is that some sort of problem is happening with federation requests between (at least) transport03 and transport01.

As mentioned in chat to help track down the cause please run the following command if you see this issue while running scenarios:

cat ~/.raiden/scenario-player/scenarios/<scenario-name>/node_<highest-run-number>_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'

(please substitute the correct values in the angle brackets).

palango commented 4 years ago

cat ~/.raiden/scenario-player/scenarios/mfee1_flat_fee/node_9_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"

Dominik1999 commented 4 years ago

➜  BF1-test cat node_3_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"

LefterisJP commented 4 years ago

Got the same problem running PFS1.

cat ~/.raiden/scenario-player/scenarios/pfs1_get_a_simple_path/node_0_003/run-000.log  | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport03.raiden.network"

konradkonrad commented 4 years ago

Same Problem for MS3.

LOGPATH="$HOME/.raiden/scenario-player/scenarios/ms3_simple_monitoring/"; cat $LOGPATH/node_$(cat $LOGPATH/run_number.txt)_*/*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport01.raiden.network"

agatsoh commented 4 years ago

I tested this today and on my local machine with the latest SP and latest raiden

(raiden_env) [krishna@krishna-pc scenario-player]$ git log
commit 8ded944959e6d469056cbce268e0c4aa64d4c66a (HEAD -> dev, upstream/dev)
Author: Nils Diefenbach <nlsdfnbch.foss@kolabnow.com>
Date:   Tue Sep 24 11:35:57 2019 +0200

    Update scenario-example-v2.yaml

scenario-player-run_pfs5_too_low_capacity_2019-09-24T16:48:32.log

pfs5_too_low_capacity_node_logs.zip

err508 commented 4 years ago

I think these three logs are interesting, because all nodes are on transport3 here. It's from a failed run of mfee1_flat_fee. The user that causes problems is Alice, cb64, 16f0 are none of the other nodes, nor the scenario player.

Time line:

Scenario run starts at 10:10
at 10:11:30.75 Alice is online at transport3
at 10:11:30.76 transport1 learns that Alice is online
at 10:12:10.07156 transport1 receives a Presence update that Alice is offline from 16f0@transport1
at 10:12:10.0716 transport3 receives a Presence update that Alice is offline from cb64@transport1
at 10:17:10 Alice sync with transport03 noticing that Bob is online
at 10:17:19.7250 transport3 receives another presence update that Alice is offline from cb64@transport1
at 10:17:19.740495 transport1 receives a Presence update that Alice is offline from 16f0@transport1
at 10:17:50 the scenario fails, as Alice is supposed to be offline

shortened_html_logs.zip

err508 commented 4 years ago

@palango could you let me know what cb64 and 16f0 are doing in the scenario, could this be the pfs/ms?

palango commented 4 years ago

@err508 What are the full addresses

err508 commented 4 years ago

@palango 0x16f090b46c9eea72478a21f00764e2798d966d0b and 0xcb645ac5a359fb2c6dd2971cb5f556f6ceb1ea06 but nvm, these are two clients for the presence logging.

ulope commented 4 years ago

After I did a purge of the test transport servers databases the situation seems to have improved for many of the testers but unfortunately not all (I know @czepluch is still seeing the issue).

What I've discovered so far:

Raiden nodes that experience this issue appear to go offline because:
- They perform the normal matrix /sync long polling http request
- This triggers a federation request server side
- This request stalls for some yet unknown reason (eventually getting aborted with Connection reset by peer or a similar error)
- This leads to a 500 response from the /sync endpoint
- Crucially 500 responses don't seem to update the 'last seen' mechanism on the server
- Raiden retries matrix requests that return 500 responses
- If this goes on long enough the 30 s timeout in synapse expires (see two bullets above) and the node is assumed offline (which gets broadcast as normal to all listening peers)

It doesn't seem to be a networking issue. I've run networking tests in the background all day yesterday and haven't seen any indications that this could be the reason.

I'm going ahead with upgrading the test servers to a recent synapse release (with a workaround for #4634) anyway in the hope that this fixes the federation issue.