Open moshe-blox opened 1 month ago
A committee is a set of 4/7/13 operators. The Committee
struct hold all the Shares
, a share is a part of a validator private key that is managed by this committee. so for each validator there are 4 shares
in the committee, 1 for each operator
. so usually from the operator perspective, there's one share that belongs to you and you use it to send your messages and eventaully sign ethereum duties (block proposals/attestations). and you know all the rest of the shares in order to verify that the operators that are sending messages.
If a validator is removed from SSV, we want to remove his shares from the committee, there could be up to 500 validators in each committee. If all shares (all validators) are removed from this committee, we want to remove that committee.
Instances of the Committee
struct are stored in a map that is called validatormap
under the validatorController
( a struct and file that manages most of the business logic of the operator, incl starting and stopping validators and executing their duties using consensus.)
After all shares are removed we want to remove the instance of Committee
from the validatormap
.
So the actual issue is we protect with mutex the shares inside Committee
so we lock when a share
is added or removed (Committee.AddShare
, Committee.RemoveShare
), but we are not interlocking the committee removal from the validatormap
.
So we need to somehow increase the scope of the lock so if AddShare
is called before the removal of the committee from the validatormap, we end up not removing it. we need to somehow lock these actions together.
There are most likely race conditions in how we remove shares from Committee without locks, and in how we stop said committee if no shares are left while concurrently we could be adding a new share in the other goroutine which fetches metadata and discovers a newly activated validator?
Also, does the share map hold all shares or only active shares? If all shares then we likely dont stop Committee when there are 0 active but >0 inactive validators
We should rethink this entire subsystem and likely rewrite parts or all of it.