Open jeromy-cannon opened 1 week ago
A node command states
will be added, similar to node logs
to download state file as zip file.
Then a new flag will be added to node start
to upload state file before start network
so, it sounds like you are suggesting a new subcommand for node:
solo node state --namespace solo-e2e --node-alias node0 --save-dir ~/Downloads
(I put state instead of states), (Also, I think the ~ for home directory is not always compatible with some library functions). Do we need anything besides the state directory? might be good to check with @tomzhenghedera and Alex
Then add a new flag to solo node setup
(I think setup is more appropriate instead of solo node start
), perhaps --saved-state
which is a path to a saved state file, perhaps zip file?
Also, just be aware. For solo node add
we have code that copies state from one node to another to initialize it, which is a requirement for adding a new node.
perf state is large(200G+), it could be compressed into a single downloadable file or just could be copied in its original directory form if done locally. Also it's not clear to me whence we move into block nodes how "a saved state" will look like.
another unknow is if we could actually start up on a saved state on a new deployment. Had never done so in perf env given the way how today's deployment pipelines work. I suspect it may not work as on disk run time directory heirachy may not be completed available until genesis start up for a fresh deployment as what solo does.
with 200GB, we would probably need to further enhance our CopyTo function, as it may cause problems and become unreliable. We may need to break it up into smaller files like torrents do, send them, then reassemble them. And, then possibly trigger a retry on just the single block that failed to send across.
Also, would there be a way to minimize traffic. If they can't share a single drive or directory, then perhaps we could upload it to a pod within the cluster, but then copy from one pod to the other (keeping it inside the cluster) to decrease traffic going in/out of the Latitude.
@leninmehedy / @matteriben , perhaps you have some ideas around this also?
Should be a flag to enable to upload a single zip file, then uncompress.
The zip file can be obtain by packing everything under data/saved directories,
Today the saved states are stored in GCS buckets and gsutil to nodes at loading time. The speed is reasonable due to the nodes are inside GCP. For non-GCP nodes the states are pre-loaded for speed and reduce egress cost. On Latitude @nathanklick baked the state in a base image, the idea is to be cached on each host. A new image needs to be baked for each release but it's in frequent given we have a monthly release cycle.
@tomzhenghedera , yes @nathanklick mentioned this during our team meeting. Sounds like the 200gb use case is outside our scope as long as we can load a unique container image. Which, I believe we can, or can do with only a small amount of work. Might be good to test this soon, though.
It has been requested by Tom from our performance tester that we be able to launch the network from a saved state. This might need more design. I'm assuming if we have the state files, we could load them into the correct directory as part of
solo node setup
with maybe a new flag pointing to a zip file. Then upload it, extract it into the correct location. I'm not sure what all else would be needed. Perhaps @JeffreyDallas or @nathanklick could chime in?