mandel59 commented 5 years ago

Pure computations always run on the local worker with the current Ssh.do implementation. How to compute whole data flow on remote?

For instance, let's think about the following code:

basedir = "/etc"
dirs <- Ssh.do { user: "user", host: "example.com" } (
    files <- File.listStatus basedir,
    dirs = (file <- files, [file.name | file.isDirectory]),
    Task.of dirs
)

At first glance, it seems that files are listed up and directries of them are selected on the remote server, and then the directories are returned to local, like the following shell script line:

echo /etc | ssh user@example.com 'xargs -L1 ls -lA | grep ^d | sed -E "s/([^ ]+ +){8}//"'

Actually it doesn't work so. Only actions are sent to and run on remote. Pure computations and the pipeline run on local like:

echo /etc | ssh user@example.com 'xargs -L1 ls -lA' | grep ^d | sed -E "s/([^ ]+ +){8}//"

Ahnfelt commented 5 years ago

Indeed. I see two approaches here:

Bring the data to the program (TopShell, second Bash example)
Bring the program to the data (first Bash example, and also MapReduce).

Shell scripts have a primitive but effective way to transmit and run a program remotely: Pack it into a string and ship it off to the target machine in the hope that it has the same flavor of shell preinstalled.

Doing the same thing for TopShell wouldn't be hard, but requiring the target machine to have TopShell preinstalled is rather inconvenient.

Another thing that counts against this approach is that you'd probably want to ship closures rather than just program strings.

Possible solution

One thing I've considered is this:

basedir = "/etc"
dirs <- Ssh.doRemotely { user: "user", host: "example.com" } #(
    files <- File.listStatus basedir,
    dirs = (file <- files, [file.name | file.isDirectory]),
    Task.of dirs
)

Where #e is syntax for serializing the e expression to something that can be transmitted, deserialized and executed on the target machine. Combined with a typing rule like if e : t then #e : Program t, some kind of mechanism to check that e can be serialized at all, and last but not least a way to actually run the resulting program on the target machine in the absence of a preinstalled TopShell.

Of course, this is still just a vague idea and would require a non-trivial amount of work to implement.

What's your thoughts on this? Do you have a different solution in mind?

mandel59 commented 5 years ago

If expressions themselves are serializable, like C#'s expression trees of lambda expressions, the special quasi-quoting syntax #e might not be needed. Moreover, it might let Program be a monad: giving ShellScriptProgram.of and ShellScriptProgram.flatMap, implementations of shell script builders, and then the binding syntax x <- e1, e2 would be available. Other code builder implementations like JavaScriptProgram or SqlProgram are also able to introcude.

topshell-language / topshell

Ssh.do: How to run pure computations on remote? #4

Possible solution