Closed agatsoh closed 4 years ago
For me this scenario passes locally. Will try to run it some more times and see if I can reproduce. Maybe it's flaky. Might be related to https://github.com/orgs/raiden-network/projects/12#card-26353251 that also seems to have shown flaky behaviour.
It was indeed the first assert_pfs_history
that failed. So my assumption would be that the PFS acted wrongly. I will look into it.
Just checked for me fails all the time, I run the scenario like this
scenario_player --chain=goerli:http://10.104.6.13:8545 run --no-ui --keystore-file=/home/krishna/.ethereum/keystore/UTC--2018-10-12T07-01-18.438476520Z--8f2e0940bed6f90f1cb14feb37f045bb79c41b2d --password=${KEYSTORE_PW} /home/krishna/raidenforked/raiden/raiden/tests/scenarios/pfs5_too_low_capacity.yaml
Took a look at the logs and also had @palango take a look. And it looks like nodes are UNREACHABLE or offline when the payment takes place. I have also managed to run into this problem locally myself now. Looks like a matrix problem. @ulope can you take a look when you're back from vacation? Or maybe @err508 has time?
This doesn't just happen for the specific scenario. I have experienced it for several different ones, which points even further towards a transport/matrix problem.
@ulope may I assign this to you? I know you're looking into it.
@andrevmatos also experienced this issue. A restart of the Raiden Node fixed it.
@ulope Here are all the logs from a run that failed.
I think I ran onto this as well yesterday. So if the PFS logs help you I can upload them.
This seems related to matrix transport. I've seen some python client I had online appear as offline to the light-client, even if it continued to operate normally. Couldn't notice anything in the logs, but it seems like the sync
stopped and the servers set the presence of the node as offline. If that's the cause, PFS is also not going to find the route because it'll see the node as offline in matrix. Restarting the node seems to fix it because it restarts the client/transport and the presence event is picked up again by the server and broadcast to the peers.
Current working theory is that some sort of problem is happening with federation requests between (at least) transport03
and transport01
.
As mentioned in chat to help track down the cause please run the following command if you see this issue while running scenarios:
cat ~/.raiden/scenario-player/scenarios/<scenario-name>/node_<highest-run-number>_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
(please substitute the correct values in the angle brackets).
cat ~/.raiden/scenario-player/scenarios/mfee1_flat_fee/node_9_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"
➜ BF1-test cat node_3_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
Got the same problem running PFS1.
cat ~/.raiden/scenario-player/scenarios/pfs1_get_a_simple_path/node_0_003/run-000.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport03.raiden.network"
Same Problem for MS3.
LOGPATH="$HOME/.raiden/scenario-player/scenarios/ms3_simple_monitoring/"; cat $LOGPATH/node_$(cat $LOGPATH/run_number.txt)_*/*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport01.raiden.network"
I tested this today and on my local machine with the latest SP and latest raiden
(raiden_env) [krishna@krishna-pc scenario-player]$ git log
commit 8ded944959e6d469056cbce268e0c4aa64d4c66a (HEAD -> dev, upstream/dev)
Author: Nils Diefenbach <nlsdfnbch.foss@kolabnow.com>
Date: Tue Sep 24 11:35:57 2019 +0200
Update scenario-example-v2.yaml
scenario-player-run_pfs5_too_low_capacity_2019-09-24T16:48:32.log
I think these three logs are interesting, because all nodes are on transport3 here.
It's from a failed run of mfee1_flat_fee.
The user that causes problems is Alice, cb64
, 16f0
are none of the other nodes, nor the scenario player.
Time line:
16f0@transport1
cb64@transport1
cb64@transport1
16f0@transport1
@palango could you let me know what cb64
and 16f0
are doing in the scenario, could this be the pfs/ms?
@err508 What are the full addresses
@palango 0x16f090b46c9eea72478a21f00764e2798d966d0b
and 0xcb645ac5a359fb2c6dd2971cb5f556f6ceb1ea06
but nvm, these are two clients for the presence logging.
After I did a purge of the test transport servers databases the situation seems to have improved for many of the testers but unfortunately not all (I know @czepluch is still seeing the issue).
What I've discovered so far:
/sync
long polling http request/sync
endpointIt doesn't seem to be a networking issue. I've run networking tests in the background all day yesterday and haven't seen any indications that this could be the reason.
I'm going ahead with upgrading the test servers to a recent synapse release (with a workaround for #4634) anyway in the hope that this fixes the federation issue.
Description
Seems the node is not getting any routes from the PFS for the target node.
scenario-player-run_pfs5_too_low_capacity_2019-09-13T13:08:38.log
node_logs_0_1_2_3_4.zip
A snapshot of the console of the SP from my local machine https://gist.github.com/agatsoh/f53f16372f87e672c21ec4989b8ea845
Edit: Raiden commit
SP commit
Expected Behavior
The node should be able to find the correct path. The PFS should reply with the correct path.