microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
766 stars 207 forks source link

Late Joiners PBFT Test Failures #1127

Closed achamayou closed 4 years ago

achamayou commented 4 years ago

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=7906&view=ms.vss-test-web.build-test-results-tab&runId=41744&resultId=100028&paneView=attachments

Traceback (most recent call last):
  File "/mnt/build/1/s/tests/late_joiners.py", line 270, in <module>
    run(args)
  File "/mnt/build/1/s/tests/late_joiners.py", line 238, in run
    cant_fail=False,
  File "/mnt/build/1/s/tests/late_joiners.py", line 141, in run_requests
    result=True,
  File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
    self.node_client, rpc_result.commit, rpc_result.term
  File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
    f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 3.176 is marked invalid and will never be committed
achamayou commented 4 years ago

Another instance this morning: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=7921&view=ms.vss-test-web.build-test-results-tab

achamayou commented 4 years ago

Fixed by #1136

achamayou commented 4 years ago

Happened again, after #1136

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8005&view=ms.vss-test-web.build-test-results-tab&runId=42108&resultId=100028&paneView=attachments

Traceback (most recent call last):
  File "/mnt/build/1/s/tests/late_joiners.py", line 270, in <module>
    run(args)
  File "/mnt/build/1/s/tests/late_joiners.py", line 238, in run
    cant_fail=False,
  File "/mnt/build/1/s/tests/late_joiners.py", line 141, in run_requests
    result=True,
  File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
    self.node_client, rpc_result.commit, rpc_result.term
  File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
    f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 3.254 is marked invalid and will never be committed
achamayou commented 4 years ago

Fresh instance: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8204&view=ms.vss-test-web.build-test-results-tab&runId=42650&resultId=100032&paneView=attachments

2020-05-08 12:48:14.776 | INFO     | infra.clients:log_request:147 - [127.35.146.218:57795] GET /nodes/getPrimaryInfo (node 0 (node))
2020-05-08 12:48:14.806 | DEBUG    | infra.clients:log_response:160 - {'status': 200, 'result': {'current_term': 2, 'primary_host': '127.35.146.218', 'primary_id': 0, 'primary_port': '57795'}, 'error': None, 'commit': 219, 'term': 2, 'global_commit': 217}
2020-05-08 12:48:14.806 | INFO     | infra.clients:log_request:147 - [127.179.115.70:59042] GET /users/LOG_get {'id': 1000} (node 3 (user 0))
2020-05-08 12:48:14.808 | WARNING  | infra.clients:_request:387 - Got SSLError exception: HTTPSConnectionPool(host='127.179.115.70', port=59042): Max retries exceeded with url: /users/LOG_get?id=1000 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)')))
...
Traceback (most recent call last):
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
    ssl_context=context,
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 390, in ssl_wrap_socket
    return context.wrap_socket(sock)
  File "/usr/lib/python3.7/ssl.py", line 423, in wrap_socket
    session=session
  File "/usr/lib/python3.7/ssl.py", line 870, in _create
    self.do_handshake()
  File "/usr/lib/python3.7/ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/mnt/build/1/s/build/env/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='127.179.115.70', port=59042): Max retries exceeded with url: /users/LOG_get?id=1000 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1076)')))
achamayou commented 4 years ago

This looks particularly suspicious, since it's node 3 against which we fail tls handshakes. It suggests something goes wrong during the join.

2020-05-08T12:48:13.829676Z -0.627 0   [info ] ode/rpc/../rpc/member_frontend.h:240 | Node 3 is now TRUSTED
2020-05-08T12:48:13.829718Z        100 [fail ] ../src/host/node_connections.h:301   | Cannot add node 3: already in use
achamayou commented 4 years ago

Same error, at id 1000 again: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8210&view=logs&j=5435e0ac-25e5-5426-50be-61b0d0ea8d34&t=63d660c3-6f76-573a-0154-fd0bcb096ded

Nodes 0, 1 and 2 all log "Cannot add node 3: already in use". That does seem to happen on successful tests as well though, so it's likely a red herring.

olgavrou commented 4 years ago

I think what is happening here is that node 3 was slow to catchup, so in some cases it needs more time

olgavrou commented 4 years ago

Cannot add note 3 happens on every test run after the node has become TRUSTED. Should probably look at why that is logged, but it has nothing to do with the test failure.

achamayou commented 4 years ago

@olgavrou is that right? It's a three second timeout for 1000 tx on a release build, it ought to be more than enough (quote verification is disabled here).

achamayou commented 4 years ago

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8301&view=ms.vss-test-web.build-test-results-tab&runId=42968&resultId=100031&paneView=attachments

2020-05-13 12:15:59.027 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:02.046 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:02.146 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:05.182 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:05.282 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:08.342 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:08.442 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:11.497 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:11.598 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:14.642 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:14.742 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))
2020-05-13 12:16:17.793 | ERROR    | __main__:wait_for_nodes:117 - Timeout error for LOG_get on node 1
2020-05-13 12:16:17.894 | INFO     | infra.clients:log_request:151 - [127.21.125.197:44629] POST /users/LOG_record {'id': 4001, 'msg': 'Goodbye, world!'} (node 1 (user 0))

Node 1 log:

2020-05-13T12:13:01.419028Z -1.767 0   [info ] /consensus/pbft/libbyz/state.cpp:475 | Rolling back to checkpoint before 131
2020-05-13T12:13:01.419043Z -1.767 0   [info ] nsensus/pbft/libbyz/replica.cpp:2147 | Rolled back in view change to seqno 131, to version 263, last_executed was 131, last_tentative_execute was 132, last gb seqno 131, last gb version was 262
2020-05-13T12:13:01.419055Z -1.767 0   [info ] nsensus/pbft/libbyz/replica.cpp:2155 | Roll back done, last tentative execute and last executed are 131 131
2020-05-13T12:13:01.423149Z -1.771 0   [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13T12:13:04.985625Z -1.937 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:15.534689Z -2.483 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:18.282298Z -2.634 0   [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 0 with digest 7592741314972068999, v: 1
2020-05-13T12:13:18.282331Z -2.634 0   [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 112 max: 133 ms: 112 last_stable: 112 last_executed: 131 last_tentative_execute: 131
2020-05-13T12:13:18.282344Z -2.634 0   [info ] nsensus/pbft/libbyz/replica.cpp:1990 | Sending new view for 1
2020-05-13T12:13:18.285115Z -2.636 0   [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13T12:13:26.031275Z -2.951 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:26.074341Z -2.974 0   [info ] onsensus/pbft/libbyz/replica.cpp:206 | did not verify - m:1
2020-05-13T12:13:26.074380Z -2.974 0   [info ] onsensus/pbft/libbyz/replica.cpp:206 | did not verify - m:1
2020-05-13T12:13:36.443629Z -3.352 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:46.871051Z -3.766 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:13:57.294301Z -4.187 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:07.713243Z -4.598 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:18.163299Z -5.041 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:28.618344Z -5.492 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:39.044685Z -5.911 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:49.507623Z -6.369 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:14:59.983432Z -6.838 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:10.446348Z -7.292 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:20.877717Z -7.718 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:31.338319Z -8.173 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:41.747686Z -8.574 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:15:52.183689Z -9.005 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:16:02.639617Z -9.459 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
2020-05-13T12:16:13.058363Z -9.875 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 17
achamayou commented 4 years ago

Another occurence on the Daily build: https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8284&view=ms.vss-test-web.build-test-results-tab&runId=42918&resultId=100032&paneView=attachments

2020-05-13 06:17:39.309 | INFO     | infra.remote:log_errors:71 - 2 errors found, printing end of output for context:
2020-05-13 06:17:39.309 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.850899Z -1.401 0   [info ] nsensus/pbft/libbyz/replica.cpp:2147 | Rolled back in view change to seqno 180, to version 360, last_executed was 180, last_tentative_execute was 181, last gb seqno 180, last gb version was 359
2020-05-13 06:17:39.309 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.850902Z -1.401 0   [info ] nsensus/pbft/libbyz/replica.cpp:2155 | Roll back done, last tentative execute and last executed are 180 180
2020-05-13 06:17:39.309 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854674Z -1.404 0   [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13 06:17:39.309 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854702Z -1.405 0   [info ] nsensus/pbft/libbyz/replica.cpp:1812 | Starting view change timer for view 1
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854783Z -1.405 0   [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 3 with digest 2111869250313449187, v: 1
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:38.854786Z -1.405 0   [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 1 with digest 4926701758888857377, v: 1
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.140280Z -1.405 0   [info ] nsensus/pbft/libbyz/replica.cpp:1822 | Received new view for 1 from 1
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.140386Z -1.405 0   [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 144 max: 182 ms: 144 last_stable: 144 last_executed: 180 last_tentative_execute: 180
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.205172Z -1.470 0   [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13 06:17:39.310 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.245182Z        100 [info ] ../src/host/sig_term.h:24            | SIGTERM: Shutting down enclave gracefully...
2020-05-13 06:17:39.310 | INFO     | infra.remote:stop:455 - [127.235.155.123] closing
2020-05-13 06:17:39.374 | ERROR    | infra.remote:log_errors:67 - /mnt/build/1/s/build/workspace/late_joiners_pbft_1/out: 2020-05-13T06:16:47.112113Z        100 [fail ] ../src/host/node_connections.h:301   | Cannot add node 3: already in use
2020-05-13 06:17:39.374 | ERROR    | infra.remote:log_errors:67 - /mnt/build/1/s/build/workspace/late_joiners_pbft_1/out: 2020-05-13T06:17:39.136272Z -3.304 0   [fail ] ../src/enclave/rpc_sessions.h:82     | Replying to unknown session 10
2020-05-13 06:17:39.374 | INFO     | infra.remote:log_errors:71 - 2 errors found, printing end of output for context:
2020-05-13 06:17:39.374 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:17.451925Z -2.028 0   [info ] sensus/pbft/libbyz/view_info.cpp:312 | Sending view change view: 1
2020-05-13 06:17:39.374 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:19.799127Z -2.121 0   [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 3 with digest 2111869250313449187, v: 1
2020-05-13 06:17:39.374 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:20.153305Z -2.129 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 25
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:30.610367Z -2.584 0   [info ] ensus/pbft/libbyz/client_proxy.h:495 | Retransmitting req id: 25
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.136272Z -3.304 0   [fail ] ../src/enclave/rpc_sessions.h:82     | Replying to unknown session 10
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139831Z -3.279 0   [info ] nsensus/pbft/libbyz/replica.cpp:1780 | Received view change for 1 from 0 with digest 404585451443329397, v: 1
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139902Z -3.279 0   [info ] nsensus/pbft/libbyz/replica.cpp:1976 | Process new view: 1 min: 144 max: 182 ms: 144 last_stable: 144 last_executed: 179 last_tentative_execute: 179
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.139905Z -3.279 0   [info ] nsensus/pbft/libbyz/replica.cpp:1990 | Sending new view for 1
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.141935Z -3.281 0   [info ] nsensus/pbft/libbyz/replica.cpp:2100 | Done with process new view 1
2020-05-13 06:17:39.375 | INFO     | infra.remote:log_errors:74 - 2020-05-13T06:17:39.310655Z        100 [info ] ../src/host/sig_term.h:24            | SIGTERM: Shutting down enclave gracefully...
2020-05-13 06:17:39.375 | INFO     | infra.remote:stop:455 - [127.102.164.222] closing
2020-05-13 06:17:39.439 | INFO     | infra.ccf:stop_all_nodes:294 - All remotes stopped...
Traceback (most recent call last):
  File "/mnt/build/1/s/tests/late_joiners.py", line 270, in <module>
    run(args)
  File "/mnt/build/1/s/tests/late_joiners.py", line 238, in run
    ignore_failures=True,
  File "/mnt/build/1/s/tests/late_joiners.py", line 149, in run_requests
    wait_for_nodes(nodes, final_msg, final_msg_id)
  File "/mnt/build/1/s/tests/late_joiners.py", line 113, in wait_for_nodes
    result=True,
  File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
    self.node_client, rpc_result.commit, rpc_result.term
  File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
    f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.363 is marked invalid and will never be committed
olgavrou commented 4 years ago

The

"Transaction ID {term}.{commit_index} is marked invalid and will never be committed" RuntimeError: Transaction ID 3.176 is marked invalid and will never be committed

happens when we processes a transaction, reply to the client via the client proxy, and then replica.cpp calls globall commit for that transaction.

The global commit callback to pbft.h will update the view change too little too late, and then when the client queries for this transaction ID again, it will be in the next view

olgavrou commented 4 years ago

This is because the replicas send their RPC replies on prepare, but globally commit and call the globally_commit_cb on commit.

If it happens, for example, that:

I believe this will be mitigated if we explicitly update Pbft's view in pbft.h when there is a view change.

achamayou commented 4 years ago

Failures in both debug virtual and sgx this morning:

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=8432&view=ms.vss-test-web.build-test-results-tab&runId=43284&resultId=100029&paneView=debug

Traceback (most recent call last):
  File "/mnt/build/1/s/tests/late_joiners.py", line 269, in <module>
    run(args)
  File "/mnt/build/1/s/tests/late_joiners.py", line 237, in run
    ignore_failures=True,
  File "/mnt/build/1/s/tests/late_joiners.py", line 140, in run_requests
    result=True,
  File "/mnt/build/1/s/tests/infra/checker.py", line 74, in __call__
    self.node_client, rpc_result.commit, rpc_result.term
  File "/mnt/build/1/s/tests/infra/checker.py", line 39, in wait_for_global_commit
    f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.281 is marked invalid and will never be committed
2020-05-15 16:53:46.654 | INFO     | infra.ccf:stop_all_nodes:370 - All nodes stopped...
Traceback (most recent call last):
  File "/mnt/build/2/s/tests/late_joiners.py", line 269, in <module>
    run(args)
  File "/mnt/build/2/s/tests/late_joiners.py", line 237, in run
    ignore_failures=True,
  File "/mnt/build/2/s/tests/late_joiners.py", line 140, in run_requests
    result=True,
  File "/mnt/build/2/s/tests/infra/checker.py", line 74, in __call__
    self.node_client, rpc_result.commit, rpc_result.term
  File "/mnt/build/2/s/tests/infra/checker.py", line 39, in wait_for_global_commit
    f"Transaction ID {term}.{commit_index} is marked invalid and will never be committed"
RuntimeError: Transaction ID 2.305 is marked invalid and will never be committed
ashamis commented 4 years ago

This is because the replicas send their RPC replies on prepare, but globally commit and call the globally_commit_cb on commit.

If it happens, for example, that:

  • 0 was primary but after a view change 1 is primary
  • we issue a LOG_record PRC to node 0
  • all other replicas reply to replica 0's client proxy first
  • replica 0 prepares and replies to its client proxy
  • client proxy has all the replies it needs, so responds to the client with 2.x
  • replica 0 commits, calls global commit callback, and pbft.h updates its view to 3
  • client sends an RPC to replica 0 asking for transaction 2.x
  • node 0 now has the information 3.x for x
  • INVALID

I believe this will be mitigated if we explicitly update Pbft's view in pbft.h when there is a view change.

This issue bug is from #1157

ashamis commented 4 years ago

The issue for which this issue was opened have been resolved. This test is currently stable and if it fails again it will most likely be a different bug.