raiden-network / raiden

Raiden Network
https://developer.raiden.network
Other
1.84k stars 376 forks source link

Presence issue in BF1 #5285

Closed palango closed 4 years ago

palango commented 4 years ago

I ran the BF1 scenario on latest develop yesterday (2b6bcc233b5312788878e0d10f842ca425bb129b)

Biggest difference between sync calls

2019-11-13 16:50:35.618260 | 2019-11-13 16:50:55.791961 | 0:00:20.173701

The sync calls now happen with a maximum difference of 20 seconds.

Presence issues

However the PFS still has wrong presence information for the nodes:

5201 ^[[36mpfs-goerli-with-fee_1         |^[[0m {"source": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C", "target": "0x10918f8e0A0Fe15A36bCCD938899f05cdEaBc4aE", "value": 1     000000000000000, "max_paths": 5, "diversity_penalty": 5, "fee_penalty": 100, "reachabilities": {"0x8FcD78C69005cFCaa512739a05B7Fb4c4cef574f": "<AddressReachability.UN     REACHABLE: 2>", "0x4fc016B482df9273C4Dcb347694a847ae263ddA3": "<AddressReachability.UNREACHABLE: 2>", "0x40561304BB59D22D9e10bfa73a9607Fa0c7F5014": "<AddressReachabil     ity.UNREACHABLE: 2>", "0x38D7ccBe856f09Dcb432897988D16d35A0cC6138": "<AddressReachability.UNREACHABLE: 2>", "0xB3585E4F75b5D9eD7ba75488C6D6DC934B1d089c": "<AddressRea     chability.UNREACHABLE: 2>", "0xeb9E01c4105b2A417b17221022CEC6e8f5763f4b": "<AddressReachability.UNREACHABLE: 2>", "0x9d928AB2e1d0D54C9814029BC51E3fA1B074e128": "<Addr     essReachability.UNREACHABLE: 2>", "0x6cc9f4A546EA013d6eA76C54c99137B2a2b8200e": "<AddressReachability.UNREACHABLE: 2>", "0xB94e524511a99D739F1bC04241Ef14513C2929BA":      "<AddressReachability.UNREACHABLE: 2>", "0x2aeDD36d02F0fC3178B653f7E1096fDc82F256a9": "<AddressReachability.UNREACHABLE: 2>", "0xf5322350d1598674FcB7737C1fDf1257B8AeA     788": "<AddressReachability.UNREACHABLE: 2>", "0x93173CC3b293c6AdcD51D4BeC1E92F87935DABf0": "<AddressReachability.UNREACHABLE: 2>", "0x0E60f242c9D1017100e54D3985EbD9C     1e5Fc646D": "<AddressReachability.UNREACHABLE: 2>", "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C": "<AddressReachability.UNREACHABLE: 2>", "0x9B19E063cFADb4109Ea60287f     0D52BE7B6011392": "<AddressReachability.UNREACHABLE: 2>", "0x72D3bDa01a28AF1035e4BcBB3FAE700335F6bd08": "<AddressReachability.UNREACHABLE: 2>", "0x10918f8e0A0Fe15A36b     CCD938899f05cdEaBc4aE": "<AddressReachability.UNREACHABLE: 2>"}, "event": "Finding paths for payment", "level": "debug", "logger": "pathfinding_service.model.token_ne     twork", "timestamp": "2019-11-13 16:50:24.645558"}
5202 ^[[36mpfs-goerli-with-fee_1         |^[[0m {"error": "NoRouteFound(None)", "details": {"from_": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C", "to": "0x10918f8e0A0Fe15     A36bCCD938899f05cdEaBc4aE", "value": 1000000000000000}, "message": "No route between nodes found.", "event": "Error while handling request", "level": "warning", "logg     er": "pathfinding_service.api", "timestamp": "2019-11-13 16:50:24.653477"}

Logs

Archive 4.zip

hackaugusto commented 4 years ago

This is the request:

  "details": {
    "from_": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
    "to": "0x10918f8e0A0Fe15A36bCCD938899f05cdEaBc4aE",
    "value": 1000000000000000
  },

It was requested twice, and it failed at these times:

  "timestamp": "2019-11-13 16:44:07.536268"
  "timestamp": "2019-11-13 16:50:24.653477"

This is the first time the initiator becomes available to the PFS just after the PFS sees the channel open event:

  "address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
  "prev_state": "<AddressReachability.UNKNOWN: 3>",
  "state": "<AddressReachability.REACHABLE: 1>",
  "timestamp": "2019-11-13 16:37:36.725033"

And it goes offline at this time:

  "address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
  "prev_state": "<AddressReachability.REACHABLE: 1>",
  "state": "<AddressReachability.UNREACHABLE: 2>",
  "timestamp": "2019-11-13 16:41:07.128322"

3 minutes before the first request.

Edit: The node 0xe85ba did call sync with a maximum delay of 20s, so its status should be online. (This was checked with the gap.py and cat node_13_003/run-013.log | grep "Sync called" | jq .timestamp -r | python gap.py) Edit2: The presence test running in the matrix server used by the PFS also sees the node as offline (this is the only entry for that address in that log file, checked with ssh transport4 'cat /var/log/presence.log* | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'):

{
  "server": "https://transport04.raiden.network",
  "user_id": "@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network",
  "presence": "<UserPresence.OFFLINE: 'offline'>",
  "timestamp": "2019-11-14 11:43:26.196769"
}

Edit3: The node and the pfs were connected to different matrix servers, the pfs was connected to transport4, the node to transport1. Here is the presence test for transport1 (checked with ssh transport01 'cat /var/log/presence.log.3 /var/log/presence.log.2 /var/log/presence.log.1 /var/log/presence.log | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'):

{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:35:53.460142"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:35:58.296736"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:37:02.595703"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:39:21.239484"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:40:37.314553"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:41:07.160928"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:47:09.503588"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:50:23.804801"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-14 11:43:27.902274"}

Edit4: The transport1 synchrotron is receiving the requests every 20s, checked with ssh transport1 'for c in $(docker ps -qf name=synchrotron); do docker logs $c 2>&1; done | gzip' | gzip -d | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c | cut '-d ' -f1,2 | cut '-d,' -f1 | python gap.py (From 2019-11-13 16:35:57 to 2019-11-13 16:51:31) Edit5: The user did not roam, there is only one userid:

from raiden.network.transport.matrix.utils import first_login
from raiden.network.transport.matrix.client import GMatrixClient
from raiden.tests.utils.factories import make_signer
from eth_utils import to_normalized_address
s = make_signer()
c = GMatrixClient("https://transport01.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]
c = GMatrixClient("https://transport04.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]

Edit6: This is the only entry in the synapse db for the presence stream:

synapse=# select * from presence_stream where user_id = '@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network';
 stream_id |                                user_id                                 |  state  | last_active_ts | last_federation_update_ts | last_user_sync_ts | status_msg | currently_active
-----------+------------------------------------------------------------------------+---------+----------------+---------------------------+-------------------+------------+------------------
    119646 | @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network | offline |  1573663823479 |             1573663823479 |     1573662957900 |            | t
(1 row)

That corresponds to:

>>> # last sync ts
>>> datetime.datetime.fromtimestamp(1573662957900 / 1000)
datetime.datetime(2019, 11, 13, 13, 35, 57, 900000)
>>> # last active ts
>>> datetime.datetime.fromtimestamp(1573663823479 / 1000)
datetime.datetime(2019, 11, 13, 13, 50, 23, 479000)

Note: The last sync ts being low is not a problem, since a user is not set offline because of the active ts This is the last entry for the node in the synchrotron log:

2019-11-13 16:51:22,973 - synapse.util.caches.response_cache - 148 - INFO - GET-562733 - [sync]: no cached result for [(@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network, 20000, 's232243_119634_0_1_100_1_1_252771_1', None, False, 'QPUJWOVMPF')], calculating new one
2019-11-13 16:51:31,107 - synapse.handlers.sync - 935 - INFO - GET-562733 - Calculating sync response for @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network between StreamToken(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1') and StreamToken(room_key='s232243', presence_key=119634, typing_key=0, receipt_key=1, account_data_key=100, push_rules_key=1, to_device_key=1, device_list_key=252772, groups_key=1)
2019-11-13 16:51:31,119 - synapse.access.http.8008 - 302 - INFO - GET-562733 - 172.19.0.19 - 8008 - {@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network} Processed request: 8.148sec/-7.154sec (0.008sec, 0.001sec) (0.001sec/0.005sec/1) 0B 200! "GET /_matrix/client/r0/sync?timeout=20000&since=s232243_119634_0_1_100_1_1_252771_1&access_token=<redacted> HTTP/1.1" "python-requests/2.22.0" [0 dbevts]

Which corresponds to:

Token(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1')

Walkthrough of the matrix presence through the sync api (it may contain errors)

To set the calling user presence:

To retrieve the presence the user is interested in:

ulope commented 4 years ago

Post-mortem:

It has not been conclusively shown whether split-mode is or isn't the cause for the flaky presence. This will be investigated with a separate transport installation.