pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

[DO NOT MERGE] Test issue on Linux due to update. #540

Closed gianfra-t closed 1 month ago

gianfra-t commented 2 months ago

This PR is meant to discuss the issue found on some tests in Linux. Not intended to be merged, unless of course a fix can be found.

Tests are test_register_vaultand and test_subxt_processing_events_after_dispatch_error. Due to the extra log level, it is evident that the transaction get stuck in the node.

gianfra-t commented 2 months ago

After adding logs into the relevant crates for transaction handling, which are sc-transaction-pool and sc-basic-authorship we can finally see what is going on with the transactions that get stuck and are never finalized.

The key piece of logging is the following:

[2024-06-25T16:33:55Z DEBUG sc_basic_authorship::basic_authorship] Attempting to push transactions from the pool.
[2024-06-25T16:33:55Z DEBUG sc_basic_authorship::basic_authorship] Pool status: PoolStatus { ready: 2, ready_bytes: 312, future: 0, future_bytes: 0 }
[2024-06-25T16:33:55Z DEBUG sc_basic_authorship::basic_authorship] Consensus deadline reached when pushing block transactions, proceeding with proposing.

This happens on the authorsip module, where extrinsics are applied and added to the block. We see on the logs Consensus deadline reached when pushing block transactions... from here sometimes even before including a single one!!

For reference, when transactions are included we see:

[2024-06-25T16:37:06Z DEBUG sc_basic_authorship::basic_authorship] Attempting to push transactions from the pool.
[2024-06-25T16:37:06Z DEBUG sc_basic_authorship::basic_authorship] Pool status: PoolStatus { ready: 2, ready_bytes: 312, future: 0, future_bytes: 0 }
[2024-06-25T16:37:06Z TRACE sc_basic_authorship::basic_authorship] [0xbe9527d4bd813a710dcff0516263960217fb74af40ced2a7d70d58e8875eaa29] Pushing to the block.
[2024-06-25T16:37:08Z DEBUG sc_basic_authorship::basic_authorship] [0xbe9527d4bd813a710dcff0516263960217fb74af40ced2a7d70d58e8875eaa29] Pushed to the block.
[2024-06-25T16:37:08Z DEBUG sc_basic_authorship::basic_authorship] Consensus deadline reached when pushing block transactions, proceeding with proposing.

In that second stream, 1 transaction got included while the other one did not.

How much is the deadline?

We define the authorship for the manual seal here and the propose factory here. Both of which are relevant for proposing using the manual_seal functionality.

Function run_manual_seal function will attempt to seal a new block which will call propose of the propose factory, the important bit is that the deadline is a constant set to 10 seconds.

Internally, propose calls propose_with with a deadline using the following formula, which should leave many seconds for transactions to get applied. Further down the line, propose_with will call apply_extrinsics which should apply as many ready transactions on the pool as possible. There is where we see the deadline being hit.

The question is why is this deadline being met so quickly? This would also explain in theory why it can fail "sometimes", since it will depend on the execution time between those processes, which is system dependent.

TorstenStueber commented 2 months ago

@gianfra-t What is the purpose of proposing a transaction?

gianfra-t commented 2 months ago

What do you mean @TorstenStueber ? Functions propose and propose_with? As I understand the term Proposal refers to a proposed block not a transaction. Internally, it attempts to apply (execute) as many transactions as possible, limited by this apparent 6 second window.

TorstenStueber commented 2 months ago

I am not familiar with the code but this seems to use manual seal on a chain that runs locally for testing purposes? Is it important to put multiple transactions into a block? Can we not just submit transactions the classical way and won't run into this problem with the Proposer (I say this with very superficial knowledge)?

ebma commented 2 months ago

Can we not just submit transactions the classical way

What's the classical way for you? Authoring blocks every 6 to 12 seconds without manual seal?

The question is why is this deadline being met so quickly? This would also explain in theory why it can fail "sometimes", since it will depend on the execution time between those processes, which is system dependent.

But we do know that it never fails for the integration tests and it also fails if you just run one of the two tests you mentioned in the description in isolation. So I doubt that it's due to resource consumption or processing power of the node. We also know that it doesn't happen on macOS. What if something goes wrong when comparing the current time to the deadline? This wouldn't explain why it works for the integration tests though. Since we can't touch the code of the transaction-pool and authorship pallets (without forking), @gianfra-t have you tried comparing our configuration for the manual seal with eg. the one interlay uses? Maybe there is a discrepancy and we can solve the issue by changing some of the parameters.

gianfra-t commented 2 months ago

I do not know why this doesn't happen also on the integration tests. If it has the exact same manual seal configuration, then it may be that since the chain is generating blocks all the time, the transactions have a chance to be re-included in a block. On these failing tests, there is no new block creation after 1 or 2 blocks. Although I tried to create new blocks as a potential solution for this and that didn't work either...

I will try to compare some manual seal configurations because there is a big chance that changing that can solve this. I would like to delay the block proposing process, ideally.

Because even like @TorstenStueber said about sending only 1 transaction, sometimes the deadline is being met even after the inclusion of that single one!.

On the latest ci logs of this PR we can see how the first log for the relevant times (now and deadline) which correspond to the initiation of the sealing process happens way too early before the applying process takes place.

ebma commented 2 months ago

If it has the exact same manual seal configuration, then it may be that since the chain is generating blocks all the time, the transactions have a chance to be re-included in a block.

But in the integration tests, we are using manual seal as well so the blocks are not generated all the time either. For every test, we are launching a separate chain so that the tests don't influence each other and each one of them has a 'clean environment'.

Correct me if I'm wrong or misunderstanding the problem, but the authorship pallet rejects including the transaction into the block because it probably thinks that the transaction is too large and would require more time for processing than is available in the block (?)

gianfra-t commented 2 months ago

Yes, but in the integration tests transactions are being sent asynchronously all the time, and every time a transaction gets into the pool the node (with manual seal) will start another block, otherwise it won't. Or do you say that every test on the integration also waits for the transaction to be finalized? I do not know enough about those tests.

And regarding the rejection what I understand from the logs and the corresponding line where the deadline is checked, is that the the transaction gets fetched from the pool but is not even attempted to be executed. Even the estimation happens later.

And Interlay, for instance, seems to have the same config :/ (That would be create_empy and finalize set to true)

ebma commented 2 months ago

Hmm you are right, the size of the transaction couldn't be it. Thanks for sharing the links.

This configuration is different though, maybe it makes a difference?

gianfra-t commented 2 months ago

Mmmm interesting. What changes is the inherent provider. I think those are applied before the extrinsics, but if ours takes a long time it could make the difference. Worth the shot!

gianfra-t commented 1 month ago

I wanted to write a summary of the findings found using the extra logs. In the last attempt to solve this issue I modified the manual seal definition such that it does not perform any seal at all unless called. Therefore, on tests, we need to call this seal periodically which will also trigger the finalization of the blocks.

All this is an attempt to solve the overall issue described here. The rationale being that the sealing process (and therefore the initial setting of the deadline) is done AFTER the extrinsic have been included. Yet this does not solve the issue.

Exploring the logs of the corresponding CI test we can see the following:

I have not been able to understand which operation inside the node consumes all this time, nor if any configuration we define is to blame.