Open domwoe opened 3 years ago
@esplinr -- can your team take a look at this? Thanks!
@esplinr I added the info on the software version we are using. If more logs or information are needed please ask
It'll take me a while to find someone with the right expertise to triage this and provide some direction.
Small update:
I've written a small script to remove the dangling queue entries from the node_status_db
directly.
Still, would be good to have more insights on why this happened and I think in general it's not a good idea to allow that such a large queue is built up.
After small investigation of logs and validator-info, I have the next results:
InstanceChangeProvider: Discard InstanceChange from **** for ViewNo 8045 because it is out of date (was received 7200sec ago)
. f
nodes. Also, from validator-info we can see, that only 9 IC messages are receiving for each propagated viewNo, and it's not enough.node_status_db
or other internal storage is not a good idea at all, cause it may lead to getting unpredictable behaviour.The main recommendation from my point of view here is:
Thanks @anikitinDSR for looking into this.
1) One pool restart which restarted most of the nodes except 3 because they're running on docker and don't have the node control tool 2) A pool restart synchronised with manual restart of 2 of the docker nodes.
I struggle to understand your second point.
My interpretation of @anikitinDSR 's comment is that the logs suggest the network is partitioned, which can happen if more than f nodes are rebooted at the same time. That could be why some nodes are seeing an older view change than the others.
My suggestion is that you rebuild just the nodes that are seeing the old view change, and let them catch up to have a consistent view with the rest of the network.
If that isn't the problem, you'll have to trace the system to figure out what exactly is going on.
The final results and suggestions after investigations and discussion. Current size of IC_queue is not critical and it's a big problem in general. For preventing this behaviour I suggest the next plan:
txnPoolNodeSet
and inserting different IC messages can be used here._update_vote
method for each element in IC_queue
could be enough.
We experience an issue with a huge IC_queue. Our validator_info response has a size of around 7MB. The issue started already in March, when lots of view changes were triggered before the view change could be completed. View_no to be voted for increased from 8044 to 10931. Then a node voted for a view change to 8045 again which was successful. However every node still carries the huge IC_queue. Restarting does not help since the IC_queue is persisted.
IC - Instance Change
I've attached (parts of) the log from the view_change_trigger_service.py. extract.txt
Looking at the code it seems that this is the only point where instance change messages get removed.
Is there a way to flush the IC queue without deleting the indy node's data directory? About 10 stewards would need to do this in this case. Can we do something to prevent such a build up of (unsuccessful) view changes?
I've reported this issue/asked the questions also on Rocketchat.
Thanks for taking the time to look into this!
Network Details
13 validator nodes in March. Today 15.
Software
There is some variation on the exact os_version among the validator nodes.