paritytech / zombienet-sdk

ZombieNet SDK
https://paritytech.github.io/zombienet-sdk/zombienet_sdk/
GNU General Public License v3.0
28 stars 8 forks source link

[WIP] Allow to respawn node with different set of arguments #190

Open s0me0ne-unkn0wn opened 4 months ago

s0me0ne-unkn0wn commented 4 months ago

I made these changes while researching a quite specific use case. Not all of them may be useful for the general public, but I believe some of them may.

Everything is implemented for the native provider only at the moment. I currently lack the expertise to implement it for Kubernetes.

It also introduces changes in publicly exposed interfaces, so that should be considered a breaking change.

An approximate list of what's been done and why:

Thoughts and ideas:

I'll add a usage example a bit later. Feel free to ask questions, provide suggestions, add your commits, etc.

s0me0ne-unkn0wn commented 4 months ago

Usage example: https://github.com/s0me0ne-unkn0wn/peerid-change-test/blob/master/src/main.rs

pepoviola commented 4 months ago

Hi @s0me0ne-unkn0wn, thanks for your feedback and contribution!!

Related to individual changes:

Got rid of nodes_by_name hashmap in the orchestrator::Network. It didn't make sense from the POV of performance (as a zombienet is unlikely to be of size exceeding a couple of hundreds of nodes), but cloning structures around was effectively sealing the network at the moment it was spawned, not allowing to change it dynamically;

Fixed a bug which resulted in double-adding nodes to network::network::Relaychain and network::network::Parachain when a new validator or collator is added;

This looks great!

Introduced a call allowing to kill a node without restarting it;

Did you think make sense to expose this method? Looks like will be always using followed by a re-spawn call since only kill the node will leave the node in the network but the process/pod inaccessible (maybe we can add a state flag). Maybe we can call it internally in the respawn method.

Introduced a call allowing to re-spawn a previously killed node, altering its spec between the kill and the re-spawn;

I think this is a great, we made some hack in v1 to allow to restart the node with a modified cmd and this is a more clean approach. We can check which other fields make sense to allow to modify in the spec to cover other use-cases. Also, one thing to notice is that key/peer_id is pretty much bounded to the node name by default and if we allow to change those we are breaking that pattern. I think that will not cause issues in most of the cases but we should document so users know that naming/key/peer_id can be changed.

Implemented means of dynamically replacing the node inside a network.

This looks great!!

Thoughts and ideas:

Narrowing down the use case to changing node command line arguments during the restart is a decision dictated by my specific research. Probably, it may be generalized to like "change everything you want while the node is dead", but the scope of the allowed changes has to be discussed;

Sounds good, command, subcommand, args and env are other good candidates to evaluate.

The overall approach is somewhat brittle and not that foolproof. It would be better to have something like pub async fn restart_with(&mut self, mutate: impl FnMut(&mut NetworkNode)) that would kill the node, run the closure to alter its configuration, and then re-spawn the node, but that gives less leverage to the developer. This needs to be discussed as well.

Yes, restart_with sounds good but as you said we should think what is the better ux for the developers.

Again, thanks for your feedback and contribution!! If you have time I will ping you to talks about this and implement the needed changes to support this feature also in kubernetes.

Thx!

-- Side note, I think the wait_for_metric fn from your example could be also a good candidate to implement and very useful for developers.

s0me0ne-unkn0wn commented 4 months ago

If you have time I will ping you to talks about this

Sure! It's a great tool we're actively using so I'm interested in making it better :)