Closed svenski123 closed 4 years ago
@carllin - this looks like an assert that you added. Can you take a look please
@ryoqun, looking into this area, is there anything that has changed that may have affected this code path since we last looked at it?
Hmm, this looks very odd and scary.. is it ensured for remove_slot
to be frozen, btw? Rather dumb guess is that others are writing accounts at the remove_slot
...
@ryoqun actually it's quite the opposite, remove_slot
should be guaranteed to be not frozen because it was marked dead (hence the purge)
So for some context, this is where the remove_slot
is passed: https://github.com/solana-labs/solana/blob/master/core/src/repair_service.rs#L382-L391
Until the call to blockstore.clear_unconfirmed_slot(*slot);
which clears the dead
flag from blockstore, replay_stage
should not attempt to replay the slot. So this should mean nothing is touching the accounts for that slot through replay...
hehe I got it to repo on a 6 node testnet by cutting up ReplayStage to make those InvalidTickCount
errors happen more often: https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?orgId=2&from=now-15m&to=now&panelId=35&tab=metrics&refresh=10s&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=testnet-dev-carllin&var-hostid=All
Branch: https://github.com/carllin/solana/tree/Debugging
testnet config:
1) net/gce.sh create -p testnet-dev-carllin -n 10 -c 1 -z us-west1-a -d pd-ssd --dedicated
2) net/net.sh restart -c "bench-tps=1=--tx_count 5000 --thread-batch-sleep-ms 400"
@ryoqun I think i see what's happening:
1) repair_service
calls remove_unrooted_slot
. It finishes the call to handle_reclaims
here: https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1195and adds itself to
dead_slots`.
2) Another thread calls process_dead_slots
and clears the dead_slots
here: https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L754
3) repair_service
calls process_dead_slots
: https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1199 sees an empty dead_slots
, doesn't clear anything
4) Assertion fails
@carllin Thanks for debugging!!! Yeah. That hypothesis is very likely...
So, maybe we should just remove that assert!
? Or is it required to purge any data related to remove_slot
before repair_service
returns from there? In that case, just wait?
remove_slot
should be guaranteed to be not frozen because it was marked dead (hence the purge)
I see! Thanks for explanation!
yeah I think it's important that all data related to remove_slot
is returned b/c when repair_service
returns, replay of remove_slot
can happen again. So I guess we wait?
@svenski123 Thanks for reporting in detail! We believe this's fixed on master, to be released in the 1.2.3. :)
Problem
Thread 'solana-repair-service' panicked and killed this TdS validator. Log and stack trace below/
Proposed Solution
TBD