hashgraph / solo

An opinionated CLI tool to deploy and manage standalone test networks.
Apache License 2.0
27 stars 7 forks source link

Be able to launch a network from saved state #817

Open jeromy-cannon opened 1 week ago

jeromy-cannon commented 1 week ago

It has been requested by Tom from our performance tester that we be able to launch the network from a saved state. This might need more design. I'm assuming if we have the state files, we could load them into the correct directory as part of solo node setup with maybe a new flag pointing to a zip file. Then upload it, extract it into the correct location. I'm not sure what all else would be needed. Perhaps @JeffreyDallas or @nathanklick could chime in?

JeffreyDallas commented 3 days ago

A node command states will be added, similar to node logs to download state file as zip file.

Then a new flag will be added to node start to upload state file before start network

jeromy-cannon commented 3 days ago

so, it sounds like you are suggesting a new subcommand for node: solo node state --namespace solo-e2e --node-alias node0 --save-dir ~/Downloads (I put state instead of states), (Also, I think the ~ for home directory is not always compatible with some library functions). Do we need anything besides the state directory? might be good to check with @tomzhenghedera and Alex

Then add a new flag to solo node setup (I think setup is more appropriate instead of solo node start), perhaps --saved-state which is a path to a saved state file, perhaps zip file?

Also, just be aware. For solo node add we have code that copies state from one node to another to initialize it, which is a requirement for adding a new node.

tomzhenghedera commented 2 days ago

perf state is large(200G+), it could be compressed into a single downloadable file or just could be copied in its original directory form if done locally. Also it's not clear to me whence we move into block nodes how "a saved state" will look like.

another unknow is if we could actually start up on a saved state on a new deployment. Had never done so in perf env given the way how today's deployment pipelines work. I suspect it may not work as on disk run time directory heirachy may not be completed available until genesis start up for a fresh deployment as what solo does.

jeromy-cannon commented 2 days ago

with 200GB, we would probably need to further enhance our CopyTo function, as it may cause problems and become unreliable. We may need to break it up into smaller files like torrents do, send them, then reassemble them. And, then possibly trigger a retry on just the single block that failed to send across.

Also, would there be a way to minimize traffic. If they can't share a single drive or directory, then perhaps we could upload it to a pod within the cluster, but then copy from one pod to the other (keeping it inside the cluster) to decrease traffic going in/out of the Latitude.

jeromy-cannon commented 2 days ago

@leninmehedy / @matteriben , perhaps you have some ideas around this also?

JeffreyDallas commented 2 days ago

Should be a flag to enable to upload a single zip file, then uncompress.

The zip file can be obtain by packing everything under data/saved directories,

image

tomzhenghedera commented 2 days ago

Today the saved states are stored in GCS buckets and gsutil to nodes at loading time. The speed is reasonable due to the nodes are inside GCP. For non-GCP nodes the states are pre-loaded for speed and reduce egress cost. On Latitude @nathanklick baked the state in a base image, the idea is to be cached on each host. A new image needs to be baked for each release but it's in frequent given we have a monthly release cycle.

jeromy-cannon commented 1 day ago

@tomzhenghedera , yes @nathanklick mentioned this during our team meeting. Sounds like the 200gb use case is outside our scope as long as we can load a unique container image. Which, I believe we can, or can do with only a small amount of work. Might be good to test this soon, though.