Presence issue in BF1 - Githubissues

palango commented 5 years ago

I ran the BF1 scenario on latest develop yesterday (2b6bcc233b5312788878e0d10f842ca425bb129b)

Biggest difference between sync calls

2019-11-13 16:50:35.618260 | 2019-11-13 16:50:55.791961 | 0:00:20.173701

The sync calls now happen with a maximum difference of 20 seconds.

Presence issues

However the PFS still has wrong presence information for the nodes:

5201 ^[[36mpfs-goerli-with-fee_1         |^[[0m {"source": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C", "target": "0x10918f8e0A0Fe15A36bCCD938899f05cdEaBc4aE", "value": 1     000000000000000, "max_paths": 5, "diversity_penalty": 5, "fee_penalty": 100, "reachabilities": {"0x8FcD78C69005cFCaa512739a05B7Fb4c4cef574f": "<AddressReachability.UN     REACHABLE: 2>", "0x4fc016B482df9273C4Dcb347694a847ae263ddA3": "<AddressReachability.UNREACHABLE: 2>", "0x40561304BB59D22D9e10bfa73a9607Fa0c7F5014": "<AddressReachabil     ity.UNREACHABLE: 2>", "0x38D7ccBe856f09Dcb432897988D16d35A0cC6138": "<AddressReachability.UNREACHABLE: 2>", "0xB3585E4F75b5D9eD7ba75488C6D6DC934B1d089c": "<AddressRea     chability.UNREACHABLE: 2>", "0xeb9E01c4105b2A417b17221022CEC6e8f5763f4b": "<AddressReachability.UNREACHABLE: 2>", "0x9d928AB2e1d0D54C9814029BC51E3fA1B074e128": "<Addr     essReachability.UNREACHABLE: 2>", "0x6cc9f4A546EA013d6eA76C54c99137B2a2b8200e": "<AddressReachability.UNREACHABLE: 2>", "0xB94e524511a99D739F1bC04241Ef14513C2929BA":      "<AddressReachability.UNREACHABLE: 2>", "0x2aeDD36d02F0fC3178B653f7E1096fDc82F256a9": "<AddressReachability.UNREACHABLE: 2>", "0xf5322350d1598674FcB7737C1fDf1257B8AeA     788": "<AddressReachability.UNREACHABLE: 2>", "0x93173CC3b293c6AdcD51D4BeC1E92F87935DABf0": "<AddressReachability.UNREACHABLE: 2>", "0x0E60f242c9D1017100e54D3985EbD9C     1e5Fc646D": "<AddressReachability.UNREACHABLE: 2>", "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C": "<AddressReachability.UNREACHABLE: 2>", "0x9B19E063cFADb4109Ea60287f     0D52BE7B6011392": "<AddressReachability.UNREACHABLE: 2>", "0x72D3bDa01a28AF1035e4BcBB3FAE700335F6bd08": "<AddressReachability.UNREACHABLE: 2>", "0x10918f8e0A0Fe15A36b     CCD938899f05cdEaBc4aE": "<AddressReachability.UNREACHABLE: 2>"}, "event": "Finding paths for payment", "level": "debug", "logger": "pathfinding_service.model.token_ne     twork", "timestamp": "2019-11-13 16:50:24.645558"}
5202 ^[[36mpfs-goerli-with-fee_1         |^[[0m {"error": "NoRouteFound(None)", "details": {"from_": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C", "to": "0x10918f8e0A0Fe15     A36bCCD938899f05cdEaBc4aE", "value": 1000000000000000}, "message": "No route between nodes found.", "event": "Error while handling request", "level": "warning", "logg     er": "pathfinding_service.api", "timestamp": "2019-11-13 16:50:24.653477"}

Logs

Archive 4.zip

hackaugusto commented 5 years ago

This is the request:

  "details": {
    "from_": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
    "to": "0x10918f8e0A0Fe15A36bCCD938899f05cdEaBc4aE",
    "value": 1000000000000000
  },

It was requested twice, and it failed at these times:

  "timestamp": "2019-11-13 16:44:07.536268"
  "timestamp": "2019-11-13 16:50:24.653477"

This is the first time the initiator becomes available to the PFS just after the PFS sees the channel open event:

  "address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
  "prev_state": "<AddressReachability.UNKNOWN: 3>",
  "state": "<AddressReachability.REACHABLE: 1>",
  "timestamp": "2019-11-13 16:37:36.725033"

And it goes offline at this time:

  "address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
  "prev_state": "<AddressReachability.REACHABLE: 1>",
  "state": "<AddressReachability.UNREACHABLE: 2>",
  "timestamp": "2019-11-13 16:41:07.128322"

3 minutes before the first request.

Edit: The node 0xe85ba did call sync with a maximum delay of 20s, so its status should be online. (This was checked with the gap.py and cat node_13_003/run-013.log | grep "Sync called" | jq .timestamp -r | python gap.py) Edit2: The presence test running in the matrix server used by the PFS also sees the node as offline (this is the only entry for that address in that log file, checked with ssh transport4 'cat /var/log/presence.log* | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'):

{
  "server": "https://transport04.raiden.network",
  "user_id": "@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network",
  "presence": "<UserPresence.OFFLINE: 'offline'>",
  "timestamp": "2019-11-14 11:43:26.196769"
}

Edit3: The node and the pfs were connected to different matrix servers, the pfs was connected to transport4, the node to transport1. Here is the presence test for transport1 (checked with ssh transport01 'cat /var/log/presence.log.3 /var/log/presence.log.2 /var/log/presence.log.1 /var/log/presence.log | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'):

{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:35:53.460142"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:35:58.296736"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:37:02.595703"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:39:21.239484"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:40:37.314553"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:41:07.160928"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:47:09.503588"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:50:23.804801"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-14 11:43:27.902274"}

Edit4: The transport1 synchrotron is receiving the requests every 20s, checked with ssh transport1 'for c in $(docker ps -qf name=synchrotron); do docker logs $c 2>&1; done | gzip' | gzip -d | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c | cut '-d ' -f1,2 | cut '-d,' -f1 | python gap.py (From 2019-11-13 16:35:57 to 2019-11-13 16:51:31) Edit5: The user did not roam, there is only one userid:

from raiden.network.transport.matrix.utils import first_login
from raiden.network.transport.matrix.client import GMatrixClient
from raiden.tests.utils.factories import make_signer
from eth_utils import to_normalized_address
s = make_signer()
c = GMatrixClient("https://transport01.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]
c = GMatrixClient("https://transport04.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]

Edit6: This is the only entry in the synapse db for the presence stream:

synapse=# select * from presence_stream where user_id = '@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network';
 stream_id |                                user_id                                 |  state  | last_active_ts | last_federation_update_ts | last_user_sync_ts | status_msg | currently_active
-----------+------------------------------------------------------------------------+---------+----------------+---------------------------+-------------------+------------+------------------
    119646 | @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network | offline |  1573663823479 |             1573663823479 |     1573662957900 |            | t
(1 row)

That corresponds to:

>>> # last sync ts
>>> datetime.datetime.fromtimestamp(1573662957900 / 1000)
datetime.datetime(2019, 11, 13, 13, 35, 57, 900000)
>>> # last active ts
>>> datetime.datetime.fromtimestamp(1573663823479 / 1000)
datetime.datetime(2019, 11, 13, 13, 50, 23, 479000)

Note: The last sync ts being low is not a problem, since a user is not set offline because of the active ts This is the last entry for the node in the synchrotron log:

2019-11-13 16:51:22,973 - synapse.util.caches.response_cache - 148 - INFO - GET-562733 - [sync]: no cached result for [(@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network, 20000, 's232243_119634_0_1_100_1_1_252771_1', None, False, 'QPUJWOVMPF')], calculating new one
2019-11-13 16:51:31,107 - synapse.handlers.sync - 935 - INFO - GET-562733 - Calculating sync response for @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network between StreamToken(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1') and StreamToken(room_key='s232243', presence_key=119634, typing_key=0, receipt_key=1, account_data_key=100, push_rules_key=1, to_device_key=1, device_list_key=252772, groups_key=1)
2019-11-13 16:51:31,119 - synapse.access.http.8008 - 302 - INFO - GET-562733 - 172.19.0.19 - 8008 - {@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network} Processed request: 8.148sec/-7.154sec (0.008sec, 0.001sec) (0.001sec/0.005sec/1) 0B 200! "GET /_matrix/client/r0/sync?timeout=20000&since=s232243_119634_0_1_100_1_1_252771_1&access_token=<redacted> HTTP/1.1" "python-requests/2.22.0" [0 dbevts]

Which corresponds to:

Token(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1')

Walkthrough of the matrix presence through the sync api (it may contain errors)

To set the calling user presence:

To retrieve the presence the user is interested in:

ulope commented 4 years ago

Post-mortem:

Presence updates were flaky
- Some old and new bugs / changes in the transport caused delays in calling /sync which lead to nodes being considered offline by the servers
- After fixing those problems presence was still flaky
- Some nodes reverted to offline state even though they were regularly calling /sync.
- Presence monitoring showed inconsistent presence state for the same users between the servers
Thursday, 2019-11-14
- Evening: Updated Matrix servers to 1.5.1
- Some bugs related to presence are fixed in that version, the hope was that our issue would be resolved.
Friday, 2019-11-15
- No change in presence flakiness
- Noon: @ulope wants to test the hypothesis that synapse split mode is the cause for flaky presence.
- Changed server configuration to non-split mode
- Presence monitoring showed consistent results after that
- Scenarios were still failing because PFS didn't have a route
Monday, 2019-11-18
- @hackaugusto discovered that search_user_directory wasn't returning non-local users which lead to the PFS not being able to monitor all relevant nodes
- Evening: Hypothesis that tainted server state could lead to this behaviour
- Server state was backed up and then wiped
- BF1 scenario ran successfully
- Server torture test temporarily made user search issue re-appear
  - Monitoring showed big event processing lag of 5+ minutes
  - Synapse debug logging was producing enormous amounts of logs. Needs to be tested how much influence on performance this has
Tuesday, 2019-11-19
- User search behaviour was back to normal
- BF1 scenarios still succeeding
- @ulope discovered that he made a mistake when changing the server configuration to non-split on Friday.
- Namely only changing the update_user_directory setting in the synapse config template not the actual config file, which is the likely the cause for the user directory search issue seen on Monday. The state wipe on monday night caused the config file to be regenerated thereby fixing the setting.

It has not been conclusively shown whether split-mode is or isn't the cause for the flaky presence. This will be investigated with a separate transport installation.

raiden-network / raiden

Presence issue in BF1 #5285

Biggest difference between sync calls

Presence issues

Logs