New Resource: Remote Execution

purpleidea commented 6 years ago

Background

In mgmt I implemented a core "remote execution" feature. A blog post about this is available: https://ttboj.wordpress.com/2016/10/07/remote-execution-in-mgmt/ One of my goals with this feature was to show how bootstrapping new clusters could be achieved (one of the valid uses for central orchestration IMHO) and to show how a distributed system that operates like mgmt can have remote execution as a special case.

Problem

There are a few problems with the current implementation:

0) Some of the code is in need of a clean up.

I should have gone with a "populate struct" + Init() error model instead of the giant New() constructor which I implemented.

1) It has rotted a bit since golang (which doesn't break API) broke API and it now fails with:

remote.go:1021: Remote: Error: Remote: SSH errored with: Can't dial: ssh: must specify HostKeyCallback

This is because host key checking is now mandatory and it wasn't previously implemented. See: https://godoc.org/golang.org/x/crypto/ssh#HostKeyCallback

Patches are welcome.

2) The remote execution code is run from the lib/main.go entry point which complicates the logic.

This can be improved. See below for a solution.

3) Dynamic, code-controlled remote execution could be more useful and powerful.

Fitting the current implementation into the language model is difficult.

Solution

I realized that this could be re-implemented as a remote resource. https://github.com/purpleidea/mgmt/blob/master/docs/resource-guide.md -- The resource could take arguments as we currently do and kick off connections when needed.

This would pull it out of the main body and into a resource which is cleaner. Also it would allow all the power of the language in order to make this more dynamic.

Bonus

As a bonus to this change, we could even implement a small GAPI frontend which creates a small graph with remote resources to emulate the current CLI mode of operation.

Conclusion

This is a bit of a tricky issue, but good for a medium golang hacker preferably with at lease medium experience with the golang API. We're happy to mentor you if you'd like to write this!

Happy Hacking!

purpleidea commented 6 years ago

Design note:

I was thinking about how the Watch part of this resource should work. Here's one idea:

Watch should open up a connection to our local etcd store on the membership list. When we remote to a new host and "bootstrap" it, it will join the cluster. Once that's done, we'll get an event :) If it ever disconnects from the cluster, we'll get a "remove" event, and we can run CheckApply again.

We might need a way to detect if a machine is running an mgmt instance happily or not. In particular so that the CheckApply portion can have something to easily check. It could (1) check just by looking at etcd membership, or (2) have an option to check by logging in and running ps or something similar. There could be many ways to check.

purpleidea commented 6 years ago

(I'm working on this)

More notes:

Since autogrouping isn't necessary, the code can be simplified to only need to run one SSH connection per resource. No need for the built-in multiple connection support.
The semaphore feature can be removed, since we can use the mgmt core semaphore feature instead if we have multiple remote resources running in parallel.
We could either have the command bootstrap mgmt or run an alternate command.
We don't need to pass through code, because that happens automatically with the deploys feature.
Watch will have to establish the base connection and optional tunnels for etcd if it can't connect directly. That could be an option (tunnel etcd or direct connect).
We must not have the CheckApply portion run indefinitely, since that would block the resource execution. As a result, we need the running mgmt to daemonize the launched process somehow. Cluster-wide converged timeout can be an optional feature that happens through the Watch() connection.

JefMasereel commented 2 years ago

This sounds like an interesting feature for my use case, what is the current status of this issue? Any ways I could contribute?

purpleidea commented 2 years ago

@JefMasereel Want to tell us more about your use case?

I have a bunch of this code written, and partially ported to the Res interface, however it is missing some internal API's for determining IP addresses, which is why I never pushed it out yet.

I don't recommend this as an early patch as it's very complex in many ways, and in particular plumbing the internal IP fixes might be challenging, so I'd suggest getting something different under your belt first before tackling this. When it is the right time, I'll try and push out a WIP branch of that code if it's not already mergeable.

Cool?

JefMasereel commented 2 years ago

Thanks for the update. I agree that I don't have the experience yet to take over this resource, but I look forward to when I do :p

My use case is basically bootstrapping a bunch of server instances across multiple cloud providers. If we can get that to work with mgmt, that could be a great way to host cloud services without being overly dependent on one specific provider, and/or having a wider range of geographical locations to choose from.

purpleidea commented 2 years ago

Thanks for the update. I agree that I don't have the experience yet to take over this resource, but I look forward to when I do :p

We will help you get there soon =D Put in the work and you'll be the resource king. \o/

purpleidea / mgmt