threefoldtech / tfchain

Threefold Chain.
Apache License 2.0
15 stars 11 forks source link

[validators] confirmation of correct flags and procedures #981

Open coesensbert opened 3 months ago

coesensbert commented 3 months ago

It's been very long since we have added/removed validators to tfchain, for any net. Our docs and procedures are probably outdated. That was definitely the case regarding validator keys, but this is resolved now here: https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/grid_operations/grid_tfchain#re-inserting-re-setting-session-aura-gran-keys-to-same-as-controller-account

These are some of our old docs on adding/removing validators: https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/kubernetes_clusters/hagrid-prod2/applications/tfchainmainnet/Adding-validators.md https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/kubernetes_clusters/hagrid-prod2/applications/tfchainmainnet/Removing-validators.md

This is related to:

Can dev confirm:

Mik-TF commented 1 month ago

@sameh-farouk any news on this? Do you need more info from @coesensbert?

Thanks!

sameh-farouk commented 2 weeks ago

are these procedures still valid? If not we should make new ones and test

The procedures for adding a new validator remain unchanged. However, the referenced documentation is inaccurate. Using the author_rotateKeys RPC call is a simpler alternative to generating the key with subkey generate and inserting it into the node’s keystore with key insert. Executing both sequentially is incorrect.

Also, adjustments are needed where the documentation refers to the sudo module is required. The Council module should be used instead.

I will review the docs here and test the flow. I'll ensure it's revised and simplified, so you can update ops documentation accordingly.

are these flags correct for a validator node? -> https://github.com/threefoldtech/grid_deployment/blob/development/tfchain-validator/mainnet/docker-compose.yml#L25-L44

Here are my comments regarding the mentioned flags:

coesensbert commented 2 weeks ago

Great, once the flow is tested and docs updated I can continue finish the validator for the guardian stack. Thanks for the flag suggestions, resolved: https://github.com/threefoldtech/grid_deployment/commit/e4de06b7c9ece11da35c57805e9e843995174485

  • Regarding the flags, as I previously mentioned here, there’s no need to use archive mode. Instead, use --state-pruning 1000 --blocks-pruning archive for optimal storage usage.

We use the tfchain public RPC snapshot data to speed up a validator syncing with the chain. This snapshot is generated with a node with these flags: https://github.com/threefoldtech/grid_deployment/blob/development/grid-snapshots/devnet/docker-compose.yml#L10-L45 Can we still use these snapshots if we apply the different pruning flags? https://bknd.snapshot.grid.tf/

sameh-farouk commented 2 weeks ago

Can we still use these snapshots if we apply the different pruning flags?

No, they won't be compatible. This why I was recommend building two types of snapshots. one contain the entire chain and another that only contain the most recent 1000 blocks. Please note that changing state-pruning requires purging the database and syncing from scratch.

coesensbert commented 1 week ago

Can we still use these snapshots if we apply the different pruning flags?

No, they won't be compatible. This why I was recommend building two types of snapshots. one contain the entire chain and another that only contain the most recent 1000 blocks. Please note that changing state-pruning requires purging the database and syncing from scratch.

Successfully synced a devnet node from 0 with the new pruning flags. Took about 17h on an i5-12500 with nvme ssds. Stored data size is around 13GB, while a public RPC node has 110G. So that's good, we can lower the storage requirements by a lot. Need to do the same for mainnet to get the size there.

While it seems obvious indeed to have snapshots, this will present 4 new nodes to create the snapshots and more maintenance for ops. Since validators will only added for mainnet, does it make sense to only have a snapshot creator for mainnet in this case?

Mik-TF commented 1 week ago

Took about 17h

Nice work! What is the bandwidth of the machine? Curious to know. Is the bottleneck at the network or the disk speed?

Since validators will only added for mainnet, does it make sense to only have a snapshot creator for mainnet in this case?

Excellent question. IMO I agree with you here we can only go with mainnet snapshot for now. It can be discussed with the team in the following days. Will let you know if I have more info on my end.

sameh-farouk commented 6 days ago

does it make sense to only have a snapshot creator for mainnet in this case?

As you already know, Snapshots are primarily used to speed up the process of syncing new nodes when necessary, whether for adding new validators or migrating them to another machine. This procedure is not mandatory and is actually advised against in some cases due to security considerations.

From a development perspective, I have no advice here. It's better to check with team leads regarding the trade-offs you want to make. Time could be more precious in some instances.

But I have a question: why do we need an extra node for snapshot creation? Couldn't we just utilize one of the boot nodes for that as well?

Mik-TF commented 5 days ago

@sabrinasadik is checking this with @coesensbert in a couple of days (after September 19). This issue will then be updated.