Closed palango closed 4 years ago
This is the request:
"details": {
"from_": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
"to": "0x10918f8e0A0Fe15A36bCCD938899f05cdEaBc4aE",
"value": 1000000000000000
},
It was requested twice, and it failed at these times:
"timestamp": "2019-11-13 16:44:07.536268"
"timestamp": "2019-11-13 16:50:24.653477"
This is the first time the initiator becomes available to the PFS just after the PFS sees the channel open event:
"address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
"prev_state": "<AddressReachability.UNKNOWN: 3>",
"state": "<AddressReachability.REACHABLE: 1>",
"timestamp": "2019-11-13 16:37:36.725033"
And it goes offline at this time:
"address": "0xe85baEd489E21b36612993D90B4BD9fF32c8be4C",
"prev_state": "<AddressReachability.REACHABLE: 1>",
"state": "<AddressReachability.UNREACHABLE: 2>",
"timestamp": "2019-11-13 16:41:07.128322"
3 minutes before the first request.
Edit: The node 0xe85ba
did call sync with a maximum delay of 20s, so its status should be online. (This was checked with the gap.py and cat node_13_003/run-013.log | grep "Sync called" | jq .timestamp -r | python gap.py
)
Edit2: The presence test running in the matrix server used by the PFS also sees the node as offline (this is the only entry for that address in that log file, checked with ssh transport4 'cat /var/log/presence.log* | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'
):
{
"server": "https://transport04.raiden.network",
"user_id": "@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network",
"presence": "<UserPresence.OFFLINE: 'offline'>",
"timestamp": "2019-11-14 11:43:26.196769"
}
Edit3: The node and the pfs were connected to different matrix servers, the pfs was connected to transport4, the node to transport1. Here is the presence test for transport1 (checked with ssh transport01 'cat /var/log/presence.log.3 /var/log/presence.log.2 /var/log/presence.log.1 /var/log/presence.log | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c'
):
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:35:53.460142"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:35:58.296736"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:37:02.595703"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:39:21.239484"}
{"p":"<UserPresence.ONLINE: 'online'>","ts":"2019-11-13 16:40:37.314553"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:41:07.160928"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:47:09.503588"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-13 16:50:23.804801"}
{"p":"<UserPresence.OFFLINE: 'offline'>","ts":"2019-11-14 11:43:27.902274"}
Edit4: The transport1 synchrotron is receiving the requests every 20s, checked with ssh transport1 'for c in $(docker ps -qf name=synchrotron); do docker logs $c 2>&1; done | gzip' | gzip -d | grep -i e85baed489e21b36612993d90b4bd9ff32c8be4c | cut '-d ' -f1,2 | cut '-d,' -f1 | python gap.py
(From 2019-11-13 16:35:57
to 2019-11-13 16:51:31
)
Edit5: The user did not roam, there is only one userid:
from raiden.network.transport.matrix.utils import first_login
from raiden.network.transport.matrix.client import GMatrixClient
from raiden.tests.utils.factories import make_signer
from eth_utils import to_normalized_address
s = make_signer()
c = GMatrixClient("https://transport01.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]
c = GMatrixClient("https://transport04.raiden.network")
u = first_login(c, s, str(to_normalized_address(s.address)))
c.search_user_directory(to_normalized_address("0xe85baEd489E21b36612993D90B4BD9fF32c8be4C"))
[<User id='@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network'>]
Edit6: This is the only entry in the synapse db for the presence stream:
synapse=# select * from presence_stream where user_id = '@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network';
stream_id | user_id | state | last_active_ts | last_federation_update_ts | last_user_sync_ts | status_msg | currently_active
-----------+------------------------------------------------------------------------+---------+----------------+---------------------------+-------------------+------------+------------------
119646 | @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network | offline | 1573663823479 | 1573663823479 | 1573662957900 | | t
(1 row)
That corresponds to:
>>> # last sync ts
>>> datetime.datetime.fromtimestamp(1573662957900 / 1000)
datetime.datetime(2019, 11, 13, 13, 35, 57, 900000)
>>> # last active ts
>>> datetime.datetime.fromtimestamp(1573663823479 / 1000)
datetime.datetime(2019, 11, 13, 13, 50, 23, 479000)
Note: The last sync ts being low is not a problem, since a user is not set offline because of the active ts This is the last entry for the node in the synchrotron log:
2019-11-13 16:51:22,973 - synapse.util.caches.response_cache - 148 - INFO - GET-562733 - [sync]: no cached result for [(@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network, 20000, 's232243_119634_0_1_100_1_1_252771_1', None, False, 'QPUJWOVMPF')], calculating new one
2019-11-13 16:51:31,107 - synapse.handlers.sync - 935 - INFO - GET-562733 - Calculating sync response for @0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network between StreamToken(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1') and StreamToken(room_key='s232243', presence_key=119634, typing_key=0, receipt_key=1, account_data_key=100, push_rules_key=1, to_device_key=1, device_list_key=252772, groups_key=1)
2019-11-13 16:51:31,119 - synapse.access.http.8008 - 302 - INFO - GET-562733 - 172.19.0.19 - 8008 - {@0xe85baed489e21b36612993d90b4bd9ff32c8be4c:transport01.raiden.network} Processed request: 8.148sec/-7.154sec (0.008sec, 0.001sec) (0.001sec/0.005sec/1) 0B 200! "GET /_matrix/client/r0/sync?timeout=20000&since=s232243_119634_0_1_100_1_1_252771_1&access_token=<redacted> HTTP/1.1" "python-requests/2.22.0" [0 dbevts]
Which corresponds to:
Token(room_key='s232243', presence_key='119634', typing_key='0', receipt_key='1', account_data_key='100', push_rules_key='1', to_device_key='1', device_list_key='252771', groups_key='1')
Walkthrough of the matrix presence through the sync api (it may contain errors)
To set the calling user presence:
/sync
sets the presence hereTo retrieve the presence the user is interested in:
generate_sync_result
_generate_sync_entry_for_presence
filter
key in the original requestPost-mortem:
/sync
which lead to nodes being considered offline by the servers/sync
.update_user_directory
setting in the synapse config template not the actual config file, which is the likely the cause for the user directory search issue seen on Monday.
The state wipe on monday night caused the config file to be regenerated thereby fixing the setting.It has not been conclusively shown whether split-mode is or isn't the cause for the flaky presence. This will be investigated with a separate transport installation.
I ran the BF1 scenario on latest develop yesterday (2b6bcc233b5312788878e0d10f842ca425bb129b)
Biggest difference between sync calls
2019-11-13 16:50:35.618260 | 2019-11-13 16:50:55.791961 | 0:00:20.173701
The
sync
calls now happen with a maximum difference of 20 seconds.Presence issues
However the PFS still has wrong presence information for the nodes:
Logs
Archive 4.zip