Closed achamayou closed 4 years ago
Another instance this morning: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=7921&view=ms.vss-test-web.build-test-results-tab
Fixed by #1136
Happened again, after #1136
Traceback (most recent call last):
File "/mnt/build/1/s/tests/late_joiners.py", line 270, in <module>
run(args)
File "/mnt/build/1/s/tests/late_joiners.py", line 238, in run
cant_fail=False,
File "/mnt/build/1/s/tests/late_joiners.py", line 141, in run_requests
result=True,
File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
self.node_client, rpc_result.commit, rpc_result.term
File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 3.254 is marked invalid and will never be committed
2020-05-08 12:48:14.776 | INFO | infra.clients:log_request:147 - [127.35.146.218:57795] GET /nodes/getPrimaryInfo (node 0 (node))
2020-05-08 12:48:14.806 | DEBUG | infra.clients:log_response:160 - {'status': 200, 'result': {'current_term': 2, 'primary_host': '127.35.146.218', 'primary_id': 0, 'primary_port': '57795'}, 'error': None, 'commit': 219, 'term': 2, 'global_commit': 217}
2020-05-08 12:48:14.806 | INFO | infra.clients:log_request:147 - [127.179.115.70:59042] GET /users/LOG_get {'id': 1000} (node 3 (user 0))
2020-05-08 12:48:14.808 | WARNING | infra.clients:_request:387 - Got SSLError exception: HTTPSConnectionPool(host='127.179.115.70', port=59042): Max retries exceeded with url: /users/LOG_get?id=1000 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)')))
...
Traceback (most recent call last):
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 976, in _validate_conn
conn.connect()
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context,
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 390, in ssl_wrap_socket
return context.wrap_socket(sock)
File "/usr/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 725, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='127.179.115.70', port=59042): Max retries exceeded with url: /users/LOG_get?id=1000 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)')))
This looks particularly suspicious, since it's node 3 against which we fail tls handshakes. It suggests something goes wrong during the join.
2020-05-08T12:48:13.829676Z -0.627 0 [info ] ode/rpc/../rpc/member_frontend.h:240 | Node 3 is now TRUSTED
2020-05-08T12:48:13.829718Z 100 [fail ] ../src/host/node_connections.h:301 | Cannot add node 3: already in use
Same error, at id 1000 again: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8210&view=logs&j=5435e0ac-25e5-5426-50be-61b0d0ea8d34&t=63d660c3-6f76-573a-0154-fd0bcb096ded
Nodes 0, 1 and 2 all log "Cannot add node 3: already in use". That does seem to happen on successful tests as well though, so it's likely a red herring.
I think what is happening here is that node 3 was slow to catchup, so in some cases it needs more time
Cannot add note 3
happens on every test run after the node has become TRUSTED
. Should probably look at why that is logged, but it has nothing to do with the test failure.
@olgavrou is that right? It's a three second timeout for 1000 tx on a release build, it ought to be more than enough (quote verification is disabled here).
2020-05-13 12:15:59.027 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:02.046 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:02.146 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:05.182 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:05.282 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:08.342 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:08.442 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:11.497 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:11.598 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:14.642 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:14.742 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:17.793 | ERROR | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:17.894 | INFO | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
Node 1 log:
2020-05-13T12:13:01.419028Z -1.767 0 [info ] /consensus/pbft/libbyz/state.cpp:475 | Rolling back to checkpoint before 131
2020-05-13T12:13:01.419043Z -1.767 0 [info ] nsensus/pbft/libbyz/replica.cpp:2147 | Rolled back in view change to seqno 131, to version 263, last_executed was 131, last_tentative_execute was 132, last gb seqno 131, last gb version was 262
2020-05-13T12:13:01.419055Z -1.767 0 [info ] nsensus/pbft/libbyz/replica.cpp:2155 | Roll back done, last tentative execute and last executed are 131 131
2020-05-13T12:13:01.423149Z -1.771 0 [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13T12:13:04.985625Z -1.937 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:15.534689Z -2.483 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:18.282298Z -2.634 0 [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 0 with digest 7592741314972068999, v: 1
2020-05-13T12:13:18.282331Z -2.634 0 [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 112 max: 133 ms: 112 last_stable: 112 last_executed: 131 last_tentative_execute: 131
2020-05-13T12:13:18.282344Z -2.634 0 [info ] nsensus/pbft/libbyz/replica.cpp:1990 | Sending new view for 1
2020-05-13T12:13:18.285115Z -2.636 0 [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13T12:13:26.031275Z -2.951 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:26.074341Z -2.974 0 [info ] onsensus/pbft/libbyz/replica.cpp:206 | did not verify - m:1
2020-05-13T12:13:26.074380Z -2.974 0 [info ] onsensus/pbft/libbyz/replica.cpp:206 | did not verify - m:1
2020-05-13T12:13:36.443629Z -3.352 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:46.871051Z -3.766 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:57.294301Z -4.187 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:07.713243Z -4.598 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:18.163299Z -5.041 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:28.618344Z -5.492 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:39.044685Z -5.911 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:49.507623Z -6.369 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:59.983432Z -6.838 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:10.446348Z -7.292 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:20.877717Z -7.718 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:31.338319Z -8.173 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:41.747686Z -8.574 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:52.183689Z -9.005 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:16:02.639617Z -9.459 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:16:13.058363Z -9.875 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
Another occurence on the Daily build: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8284&view=ms.vss-test-web.build-test-results-tab&runId=42918&resultId=100032&paneView=attachments
2020-05-13 06:17:39.309 | INFO | infra.remote:log_errors:71 - 2 errors found, printing end of output for context:
2020-05-13 06:17:39.309 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.850899Z -1.401 0 [info ] nsensus/pbft/libbyz/replica.cpp:2147 | Rolled back in view change to seqno 180, to version 360, last_executed was 180, last_tentative_execute was 181, last gb seqno 180, last gb version was 359
2020-05-13 06:17:39.309 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.850902Z -1.401 0 [info ] nsensus/pbft/libbyz/replica.cpp:2155 | Roll back done, last tentative execute and last executed are 180 180
2020-05-13 06:17:39.309 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854674Z -1.404 0 [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13 06:17:39.309 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854702Z -1.405 0 [info ] nsensus/pbft/libbyz/replica.cpp:1812 | Starting view change timer for view 1
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854783Z -1.405 0 [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 3 with digest 2111869250313449187, v: 1
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854786Z -1.405 0 [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 1 with digest 4926701758888857377, v: 1
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.140280Z -1.405 0 [info ] nsensus/pbft/libbyz/replica.cpp:1822 | Received new view for 1 from 1
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.140386Z -1.405 0 [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 144 max: 182 ms: 144 last_stable: 144 last_executed: 180 last_tentative_execute: 180
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.205172Z -1.470 0 [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13 06:17:39.310 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.245182Z 100 [info ] ../src/host/sig_term.h:24 | SIGTERM: Shutting down enclave gracefully...
2020-05-13 06:17:39.310 | INFO | infra.remote:stop:455 - [127.235.155.123] closing
2020-05-13 06:17:39.374 | ERROR | infra.remote:log_errors:67 - /mnt/build/1/s/build/workspace/late_joiners_pbft_1/out: 2020-05-13T06:16:47.112113Z 100 [fail ] ../src/host/node_connections.h:301 | Cannot add node 3: already in use
2020-05-13 06:17:39.374 | ERROR | infra.remote:log_errors:67 - /mnt/build/1/s/build/workspace/late_joiners_pbft_1/out: 2020-05-13T06:17:39.136272Z -3.304 0 [fail ] ../src/enclave/rpc_sessions.h:82 | Replying to unknown session 10
2020-05-13 06:17:39.374 | INFO | infra.remote:log_errors:71 - 2 errors found, printing end of output for context:
2020-05-13 06:17:39.374 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:17.451925Z -2.028 0 [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13 06:17:39.374 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:19.799127Z -2.121 0 [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 3 with digest 2111869250313449187, v: 1
2020-05-13 06:17:39.374 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:20.153305Z -2.129 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 25
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:30.610367Z -2.584 0 [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 25
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.136272Z -3.304 0 [fail ] ../src/enclave/rpc_sessions.h:82 | Replying to unknown session 10
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139831Z -3.279 0 [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 0 with digest 404585451443329397, v: 1
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139902Z -3.279 0 [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 144 max: 182 ms: 144 last_stable: 144 last_executed: 179 last_tentative_execute: 179
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139905Z -3.279 0 [info ] nsensus/pbft/libbyz/replica.cpp:1990 | Sending new view for 1
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.141935Z -3.281 0 [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13 06:17:39.375 | INFO | infra.remote:log_errors:74 - 2020-05-13T06:17:39.310655Z 100 [info ] ../src/host/sig_term.h:24 | SIGTERM: Shutting down enclave gracefully...
2020-05-13 06:17:39.375 | INFO | infra.remote:stop:455 - [127.102.164.222] closing
2020-05-13 06:17:39.439 | INFO | infra.ccf:stop_all_nodes:294 - All remotes stopped...
Traceback (most recent call last):
File "/mnt/build/1/s/tests/late_joiners.py", line 270, in <module>
run(args)
File "/mnt/build/1/s/tests/late_joiners.py", line 238, in run
ignore_failures=True,
File "/mnt/build/1/s/tests/late_joiners.py", line 149, in run_requests
wait_for_nodes(nodes, final_msg, final_msg_id)
File "/mnt/build/1/s/tests/late_joiners.py", line 113, in wait_for_nodes
result=True,
File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
self.node_client, rpc_result.commit, rpc_result.term
File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.363 is marked invalid and will never be committed
The
"Transaction ID {term}.{commit_index} is marked invalid and will never be committed" RuntimeError: Transaction ID 3.176 is marked invalid and will never be committed
happens when we processes a transaction, reply to the client via the client proxy, and then replica.cpp
calls globall commit for that transaction.
The global commit callback to pbft.h
will update the view change too little too late, and then when the client queries for this transaction ID again, it will be in the next view
This is because the replicas send their RPC replies on prepare
, but globally commit and call the globally_commit_cb on commit
.
If it happens, for example, that:
0
was primary but after a view change 1
is primaryLOG_record
PRC to node 0
0
prepares and replies to its client proxy2.x
0
commits
, calls global commit callback, and pbft.h
updates its view to 3
0
asking for transaction 2.x
0
now has the information 3.x
for x
INVALID
I believe this will be mitigated if we explicitly update Pbft
's view in pbft.h
when there is a view change.
Failures in both debug virtual and sgx this morning:
Traceback (most recent call last):
File "/mnt/build/1/s/tests/late_joiners.py", line 269, in <module>
run(args)
File "/mnt/build/1/s/tests/late_joiners.py", line 237, in run
ignore_failures=True,
File "/mnt/build/1/s/tests/late_joiners.py", line 140, in run_requests
result=True,
File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
self.node_client, rpc_result.commit, rpc_result.term
File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.281 is marked invalid and will never be committed
2020-05-15 16:53:46.654 | INFO | infra.ccf:stop_all_nodes:370 - All nodes stopped...
Traceback (most recent call last):
File "/mnt/build/2/s/tests/late_joiners.py", line 269, in <module>
run(args)
File "/mnt/build/2/s/tests/late_joiners.py", line 237, in run
ignore_failures=True,
File "/mnt/build/2/s/tests/late_joiners.py", line 140, in run_requests
result=True,
File "/mnt/build/2/s/tests/infra/checker.py", line 74, in __call__
self.node_client, rpc_result.commit, rpc_result.term
File "/mnt/build/2/s/tests/infra/checker.py", line 39, in wait_for_global_commit
f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.305 is marked invalid and will never be committed
This is because the replicas send their RPC replies on
prepare
, but globally commit and call the globally_commit_cb oncommit
.If it happens, for example, that:
0
was primary but after a view change1
is primary- we issue a
LOG_record
PRC to node0
- all other replicas reply to replica 0's client proxy first
- replica
0
prepares and replies to its client proxy- client proxy has all the replies it needs, so responds to the client with
2.x
- replica
0
commits
, calls global commit callback, andpbft.h
updates its view to3
- client sends an RPC to replica
0
asking for transaction2.x
- node
0
now has the information3.x
forx
INVALID
I believe this will be mitigated if we explicitly update
Pbft
's view inpbft.h
when there is a view change.
This issue bug is from #1157
The issue for which this issue was opened have been resolved. This test is currently stable and if it fails again it will most likely be a different bug.
https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=7906&view=ms.vss-test-web.build-test-results-tab&runId=41744&resultId=100028&paneView=attachments