Open mlanting opened 3 years ago
This pull request has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros2-tooling-wg-next-meeting/12545/36
Some high level comments: while some of the aspects of the document are specific to remote execution:
ExecuteRemote
RemoteSubstitution
several other parts are applicable to a local launch invocation too:
ProcessStatus
QueryStatus
, Start/StopProcess
, Shutdown.It might be good to separate these to make each invidual part easier to design, implement and review.
Noting that we should call out explicitly the expectation that the LaunchServiceNode must be accessible within the current discoverable DDS network. The machine may be WAN-accessible via SSH but that case won't work because we use topics/services for all controls.
When talking about multi-machine message passing, how does time sync come into play?
"Out of Scope" section in design is useful.
How do we support Mac + Windows? Do we think the SSH process will work out of the box for those platforms?
This pull request has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1
Hi everyone! For me as a ros2 developer, I really do miss that feature. Any timelines on this? I see that it's only a design doc and it's hard to say, but maybe it was included in some roadmap or so
So far there is not a roadmap that I know of. We discussed the design in one of the Tooling Working Group meetings, which is where my above comments came from, but I think the next step is just an updated split design - we identified that this is really 2 separate features that can be independently designed and implemented.
@mlanting should be able to clarify if any proof-of-concept code exists already, I can't remember.
@AlexKaravaev, @pjreed made a simpler, temporary implementation of multi-machine launching that you can find here: https://github.com/pjreed/launch
I will try to get the next version of the design document out in the next week or so. @roger-strain is currently working on a refactor of the launch system. Once that is complete and this design is approved we can being implementation. I expect it will be a few months before we have anything implemented, but pj's implementation should cover basic use cases.
Just FYI, my changes in the launch
repository (on the multi-machine-launch
branch) basically just add in some changes necessary to abstract out the launching mechanism; I have another repository at https://github.com/pjreed/ssh_machine which adds an SSH-based launcher that can be used to launch nodes on remote machines via SSH. The functionality is similar to how ROS1's remote launching mechanism works, but there are a few limitations to it; there's some more documentation and an example in that ssh_machine
repo.
@mlanting @pjreed wow, thanks ! Will definitely check it out, because I was thinking of implementing similar ssh-based launcher. I understand the limitations and sure there is a need of multi-machine launch in ros2 native support, but still this repo can solve some current problems.
@mlanting et al.: this could have been discussed in the previous PR, so if it has, please just RTFM me, but has there been any thought about using existing orchestration frameworks for deployment, configuration and orchestration of multi-machine ROS applications?
I'm not immediately thinking of something like k8s, but perhaps something similar might exist, which is a little less 'heavy' but supports like operations ("infrastructure as code" and all that).
Ian Sherman in his keynote at ROSCon18 posited that "backend distributed systems and IoT" (as he phrased it) might have solutions to challenges we still see as problems. Deploying, starting and coordinating a multi-machine distributed application seems like it could be something they might have solutions or best practices for. It does seem like they might have similar requirements (ie: liveness tracking, monitoring, pushing updates, distributed configuration, failover, etc).
@AlexKaravaev: I haven't polished up my SSH-based launcher mostly because it was intended to be a stopgap solution that just implemented what we needed until a more robust solution is available, and I feel like some of the limitations (most significantly, the executable paths to nodes being resolved on the local computer before the command is sent to the remote computer) make it inappropriate for merging into the official ROS2 ecosystem at this time. Still, hopefully it works for you; let me know if you have any issues, and if it doesn't take too much work to make it more robust, it might make sense to get it into ROS2 proper until we've got a better system.
@gavanderhoorn: When I made my ssh_machine
launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine
class to use k8s for launching instead of SSH.
@pjreed wrote:
When I made my
ssh_machine
launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends thelaunch.machine.Machine
class to use k8s for launching instead of SSH.
the abstraction is nice, but wouldn't that still mean that launch
is doing the orchestration? It would be delegating some parts of it to k8s, but that would be it.
My assumption is that it's going to take "the ROS community" quite some time to get to the level of functionality which these existing solutions have already reached. Besides that, it would also seem to be duplication of effort, which is never very nice.
I think overall this looks straightforward, but it still has remnants of the process management node, which shouldn't be necessary for the simple version of this design, right?
Yeah, I think you're right. I've removed those references and fixed the errors you caught.
The document has been changed so much and it's been so long since the last time I submitted a PR for multi-machine launch changes I figured it'd be more appropriate just to create a new one.