NotInFinalizedChain: trying to revert the same finalized block hash

liuchengxu commented 1 year ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Description of bug

The error message indicates it was trying to revert the same finalized block which does not make sense to me. Let me know if I can provide more info.

2023-06-12 11:06:08 [SystemDomain] Safety violation: attempted to revert finalized block HashAndNumber { number: 1056559, hash: 0x804b72fd72d32dab576d77dc311b2074a9a01d482bb3535cd13380d2a0cf5600 } which is not in the same chain as last finalized 0x804b72fd72d32dab576d77dc311b2074a9a01d482bb3535cd13380d2a0cf5600
2023-06-12 11:06:08 [SystemDomain] Failed to process primary block on startup error=NotInFinalizedChain
2023-06-12 11:06:08 [SystemDomain] Essential task `system-executor-worker` failed. Shutting down service.
2023-06-12 11:06:08 [SystemDomain] Gossip engine has terminated.
2023-06-12 11:06:08 [SystemDomain] Essential task `system-domain-gossip` failed. Shutting down service.
2023-06-12 11:06:09 [CoreDomain] Starting relayer for domain: DomainId(3)
2023-06-12 11:06:09 [CoreDomain] [apply_extrinsic] after: 0x829eb43a62592d85e0a1dd3d8f2cb1e2cb11736bf80a5c5b0b4263d4a5c8b114
2023-06-12 11:06:09 [CoreDomain] Safety violation: attempted to revert finalized block HashAndNumber { number: 1056559, hash: 0x2e5818cb49b280f6c50d5de86b241b366998ab2521bd54035a7eb367d4fdd055 } which is not in the same chain as last finalized 0x2e5818cb49b280f6c50d5de86b241b366998ab2521bd54035a7eb367d4fdd055
2023-06-12 11:06:09 [CoreDomain] Failed to process primary block on startup error=NotInFinalizedChain
2023-06-12 11:06:09 [CoreDomain] Essential task `core-executor-worker` failed. Shutting down service.
Error: SubstrateService(Other("Essential task failed."))

BTW, we finalize the block at K-depth, specifically, when a new block at height N is imported, we attempt to finalize the block at height N-256.

Steps to reproduce

This happens on Subspace gemini-3d testnet, I guess running the fully-synced domain nodes on gemini-3d may reproduce it, but it's not trivial as the testnet is already over 1M blocks :(

bkchr commented 1 year ago

Do you have forks? Could it be that you sync some other fork? So let's say you last finalized 512 - 256 and now try to finalize 511 - 256 because you import some fork.

In general you could also just add a simple:

let new_finalized = block_number - 256;
if new_finalized > client.info().last_finalized {
     finalize(new_finalize);
}

bkchr commented 1 year ago

In general this looks like some bug in your code and not in the Substrate code as long as you can not convince me of the opposite ;)

liuchengxu commented 1 year ago

Thanks @bkchr for the suggestion, Checking new_finalized_block_number > last_finalized_block_number is helpful.

However, I think there is still a chance to make Substrate generally more robust as the raised error message is confusing to me.

I reproduced this issue locally. To illustrate, K = 3, assuming the blocks are produced as follows, 8 -> 9 > 10a -> 10b -> 11, now block 8 is finalized.

Produce a new block 11b on top of 10a, the chain reorgs from 11a to 11b,

the common block is 9, exacted blocks are [10a, 11b]. When 10a is imported on top of 9 again, it attempts to finalize block 7 (_assuming we don't check new_finalized_block_number > last_finalized_block_number_)

         10a -- 11b
       /
7 -- 8 -- 9 
       \
         10b -- 11a

https://github.com/paritytech/substrate/blob/689da495a0c0c0c2466fe90a9ea187ce56760f2d/client/service/src/client/client.rs#L926-L937

last_finalized is 8, block is 7, and the calculated route_from_finalized will have an empty enacted list and the retracted list [8], indicating that block 7 is already finalized as part of the finalized chain. That being said, we can safely do a no-op here if enacted is empty, we may raise a warning message instead of returning an error.

diff --git a/client/service/src/client/client.rs b/client/service/src/client/client.rs
index 91c59cdbac..ae1f0262cf 100644
--- a/client/service/src/client/client.rs
+++ b/client/service/src/client/client.rs
@@ -927,6 +927,14 @@ where
                        sp_blockchain::tree_route(self.backend.blockchain(), last_finalized, block)?;

                if let Some(retracted) = route_from_finalized.retracted().get(0) {
+                       if route_from_finalized.enacted().is_empty() {
+                               warn!(
+                                       "Attempted to re-finalize an older finalized block {block:?}, \
+                                       last_finalized: {last_finalized}",
+                               );
+                               return Ok(())
+                       }
+
                        warn!(
                                "Safety violation: attempted to revert finalized block {:?} which is not in the \
                                same chain as last finalized {:?}",

I can send a PR if it makes sense.

bkchr commented 1 year ago

Improving the error message is fine. However, I'm not sure about not returning an error. I see your reasoning, but we print a message as someone misused the api. So, if we print a message we should also return an error?

CC @davxy @andresilva

davxy commented 1 year ago

For sure error message could be improved, as it is not super clear. Maybe with something like : "Attempting to finalize block {hash} {number} which is not in the same chain as last finalized {last-fin-hash} {last-fin-number}."

Said that, in theory calling this function against something that has already been finalized (i.e. an ancestor of info.finalized_hash or in other words route_from_finalized.enacted().is_empty() == true) should be fine. We already doing something similar just above (here) if we re-finalize the last finalized.

BUT there are some assumptions around and this modification may have some undesired side effects at the moment. All code paths should be analyzed.

For example here we are setting this block as new_best (if import_existing is true). And even if this has no consequences in the backend (to be checked) we end up having the ImportSummary::is_new_nest == true here. This may have consequences or not. The code path should be carefully analyzed.

(actually your code doesn't call apply_finality_with_block_hash through this path, but via finalize_block, but the reasoning above still apply).

In conclusion, I think that an innocent tweak like this is correct but may have repercussions somewhere else and thus you should eventually carefully check it

liuchengxu commented 1 year ago

Not sure how long it will take to apply this tweak, in order to help the community not be trapped here, we can at least improve the docs to finalize_block to mention that the caller is responsible for not re-finalizing the older finalized block even if it's intuitively harmless.

davxy commented 1 year ago

I given had a better look and apply_finality_with_block_hash is called only by 2 places:

import_block (if import params finalized = true and import_existing = true)
finalize_block

If we return Ok in case that route_from_finalized.enacted().is_empty() (in other words finalizing something already finalized):

In both cases we don't push anything new into the ClientImportOperation, so per-se is a no-op.

BUT

In case 1 even if the call is successful this will then fail in the backend when we try to commit the operation (Error::NotInFinalizedChain => "Potential long-range attack: block not in finalized chain").

In case 2 the call ends-up being successful and looks harmless. But, as I said, we have to keep an eye for the places where this call is performed to not disrupt some assumption. For example in grandpa this may have consequences that should eventually be analyzed: https://github.com/paritytech/substrate/blob/297b3948f4a0f7f6504d4b654e16cb5d9201e523/client/consensus/grandpa/src/environment.rs#L1422-L1423 (e.g. update_best_justification just below doesn't look to perform any check and the prev best justification may end up being overwritten)

@andresilva any consideration about this and the proposal of not failing?

paritytech / substrate