integritee-network / worker

Integritee off-chain worker and sidechain validateer
Apache License 2.0
89 stars 46 forks source link

Sidechain sometimes produces blocks too slowly #1374

Open Kailai-Wang opened 1 year ago

Kailai-Wang commented 1 year ago

We have observed that sometimes the sidechain produces blocks with a low chance. It means most of the aura slots end up with "Discarding proposal" due to "too long block production time".

However, neither the parachain nor the sidechain has a lot of traffic - they are quite idle. We have only one tee-worker node, so nothing related inter-node communication.

I don't know how it could happen - I could only impute it to the miscalculation of aura slot, which means the worker grabs the aura slot way too late.

Maybe it's related to https://github.com/integritee-network/worker/issues/1275? Or is there anyway to further dissect the problem to find out the root cause?

[2023-06-21T10:43:36Z DEBUG enclave_runtime::top_pool_execution] ----------------------------------------
[2023-06-21T10:43:36Z DEBUG enclave_runtime::top_pool_execution] Start sidechain block production cycle
[2023-06-21T10:43:36Z DEBUG its_consensus_common::block_import_queue_worker] Imported 0 blocks in 0 ms (average of NaN ms per block)
[2023-06-21T10:43:36Z INFO  enclave_runtime::top_pool_execution] Elapsed time to process sidechain block import queue: 99 ms
[2023-06-21T10:43:36Z INFO  enclave_runtime::top_pool_execution] Remaining time in slot (id: Slot(281224036), stage Before AURA): 5132 ms, 85.53333333333333% of slot time
[2023-06-21T10:43:36Z DEBUG enclave_runtime::top_pool_execution] [Aura] Executing aura for slot: SlotInfo { slot: Slot(281224036), timestamp: 1687344216.768255775s, duration: 6s, ends_at: 1687344222s, last_imported_parentchain_head: <wasm:stripped> }
[2023-06-21T10:43:36Z INFO  aura] Claiming slot (281224036)
[2023-06-21T10:43:38Z INFO  integritee_service] Polling for worker for shard (2 seconds interval)
[2023-06-21T10:43:39Z DEBUG itc_parentchain_indirect_calls_executor] Scanning block 490169 for relevant xt
[2023-06-21T10:43:39Z INFO  itc_parentchain_block_importer::block_importer] Successfully imported parentchain block (number: 490169, hash: 0x9411…a6c7)
[2023-06-21T10:43:39Z INFO  itp_extrinsics_factory] Creating extrinsics using nonce: 366114
[2023-06-21T10:43:39Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Enclave wants to send 1 extrinsics
[2023-06-21T10:43:39Z DEBUG integritee_service::ocall_bridge::worker_on_chain_ocall] Send extrinsic, call length: 356
[2023-06-21T10:43:40Z INFO  integritee_service] Polling for worker for shard (2 seconds interval)
[2023-06-21T10:43:42Z INFO  itp_node_api_metadata::event] [+] Received ProcessedParentchainBlock event
[2023-06-21T10:43:42Z INFO  itp_node_api_metadata::event] [+] Received ProcessedParentchainBlock event
[2023-06-21T10:43:42Z INFO  integritee_service] Polling for worker for shard (2 seconds interval)
[2023-06-21T10:43:42Z WARN  its_consensus_slots::slots] Proposed block slot time: 1687344222480 ms, slot start: 1687344216867 ms , slot end: 1687344222000 ms
[2023-06-21T10:43:42Z WARN  aura] ⌛️ Discarding proposal for slot 281224036, block number 363027; block production took too long
[2023-06-21T10:43:42Z INFO  aura] Did not propose a block for slot 281224036 in shard 0xfeb0568c5a8430f3f20ceb2e4459f0d6066379f393521ca29ed0f864bfeac9a4
[2023-06-21T10:43:42Z DEBUG enclave_runtime::top_pool_execution] Aura executed successfully
[2023-06-21T10:43:42Z INFO  enclave_runtime::top_pool_execution] No time remaining in slot (id: Slot(281224036), stage: After AURA)
[2023-06-21T10:43:42Z DEBUG enclave_runtime::top_pool_execution] Proposing 0 sidechain block(s) (broadcasting to peers)
[2023-06-21T10:43:42Z DEBUG integritee_service::ocall_bridge::sidechain_ocall] Enclave did not produce sidechain blocks
[2023-06-21T10:43:42Z DEBUG integritee_service::ocall_bridge::sidechain_ocall] Updating peers..
[2023-06-21T10:43:42Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Successfully updated peers
[2023-06-21T10:43:42Z DEBUG integritee_service::ocall_bridge::sidechain_ocall] Broadcasting sidechain blocks ...
[2023-06-21T10:43:42Z DEBUG integritee_service::worker] No blocks to broadcast, returning
[2023-06-21T10:43:42Z INFO  integritee_service::ocall_bridge::sidechain_ocall] Successfully broadcast blocks
[2023-06-21T10:43:42Z DEBUG enclave_runtime::top_pool_execution] Sending sidechain block(s) confirmation extrinsic..
[2023-06-21T10:43:44Z INFO  integritee_service] Polling for worker for shard (2 seconds interval)
[+] Received finalized header update (490170), syncing parent chain...
[+] Found 1 block(s) to sync
Synced 490170 out of 490170 finalized parentchain blocks
[+] Found 0 block(s) to sync
clangenb commented 1 year ago

Hmm, tbh we don't have a long-running instance currently. I can't give a constructive answer here. Does the worker recover from that?

Kailai-Wang commented 1 year ago

Not really - it still produces blocks, but only once in a while.

Or to put it in another way, is there any hope that we get the aforementioned issue (https://github.com/integritee-network/worker/issues/1275) implemented soon? 🙌

clangenb commented 1 year ago

Hmm, I have one guess. Maybe it is because we don't prune the parentchain headers in the light-client-db, which are all loaded in memory. This could slowly deteriorate the performance over time.

I am currently implementing https://github.com/integritee-network/worker/issues/1370, which introduces pruning of the parentchain headers. If we are lucky, this solves your issue.

And I have to admit that we are currently not planning to implement #1275 in the near future.

Kailai-Wang commented 1 year ago

That's a good guess - we have over 385,000 sidechain blocks and the light_client.db is now 20MB. Pruning it would definitely help!

might be related in case you want to remove it too: https://github.com/integritee-network/worker/issues/1281 - we only removed the check locally, however it doesn't prevent it from growing as the parentchain extrinsic is being added constantly

Kailai-Wang commented 1 year ago

I believe the majority of the data in light_client.bin are the hashes of extrinsics sent to parentchain. They accumulate without being pruned.

clangenb commented 1 year ago

Seems, we are on the right track! :D Then the issue should be solved by next week or the week after on our end. 👍

Kailai-Wang commented 1 year ago

Any update on this? :)

clangenb commented 1 year ago

I did have to tend to some unforeseen things regarding our slot swap, which have occupied me until yesterday, and next week I am off. But the week after, I can definitely finish. ETA of remaining working hours ~8-10h.

clangenb commented 1 year ago

I might not have fully resolved this issue in #1399; however, there is a follow-up issue in #1404, which I tend to tackle as soon as I have some feedback on my suggestion, which should resolve this issue..