script() does not accept argument list, only works with valid UTF-8 string

gabriel-v commented 1 year ago

The script() task accepts a single string with the script. It prints that script to a file and uses subprocess(cmd, shell=True) to run that file.

This is a problem when your arguments need to be escaped for bash - think filenames that contain spaces, quotes, double quotes, and any special bash character you can think of. That means the caller needs to escape all arguments using bash syntax, which is not a simple thing to do.

Subprocess, on the other hand, allows calling processes by argument list, and sends those arguments directly to execv and friends, no escape needed.

subprocess.check_call("md5sum '\\'the\\'\\ file.txt', shell=True)

vs.

subprocess.check_call(['md5sum', "'the' file.txt"])

A second limitation of this system is that it assumes the command is valid UTF-8 - this is not always the case, see https://github.com/insitro/redun/issues/79

I propose we add a check for script(cmd) if cmd is a list. If it is, pass it as a list, untouched, to the subprocess.run(cmd, shell=False). If it's not a list, keep doing what it's been doing.

There will be a complication with the intermediary script: could we avoid saving the intermediary script on disk entirely and just run subprocess?

Thinking some more about it, I feel like this script() should be a thin wrapper over the entire subprocess.run/Popen API - so instead of staging a bash file that we dump the script in, we stage a pickle of (args, kwargs).

Instead of running bash, we run something like python -c "import subprocess, pickle, sys; args, kwargs = pickle.loads(open(sys.stdin)); sys.exit(subprocess.run(*args, **kwargs, stdout=sys.stdout, stderr=sys.stderr).returncode).

That way, existing scripts will be unchanged (it's still subprocess.run on some text with shell = True) - but now you suddenly support all the other subprocess.run arguments, like env, cwd, timeout that some other issues were asking for (https://github.com/insitro/redun/issues/67).

We'd need to make sure this is really backwards compatible, and doesn't break any existing script() workflows - I think all we need to do in the pickle packer proposed above is default shell to True instead of False. For safety, we would also want to prevent users from:

setting stderr or stdout to anything - we forward those
setting stdin to anything - it's probably some PIPE that will not survive execution. users should use input argument instead with string or bytestring
setting capture_output - same as above

Anything else should be fair game through.

Or, we just document these limitations and invite users to call subprocess themselves - that's perfectly fine

mattrasmus commented 1 year ago

Thanks @gabriel-v for sharing these ideas. I'll need some time to think through some of the implications for backwards compatibility.

One thing to keep in mind is that a common use of script() is to run scripts within a Docker image on a batch cluster, such as AWS Batch. In such a setting, you will not be able to stream stdin/stdout/stderr real-time with the running script. That is one reason for the copying of output and errors to files (so we can unstage them back to the scheduler running outside the cluster).

I do think its very doable to have script() allow an argv list as a first argument, in addition to allowing a shell string as a first argument. We would then just do the shlex.join(argv) for the user. This would safely handle shell quoting. Let me know what you think.

gabriel-v commented 1 year ago

Ah, forgot about shlex. Thanks for the suggestion!

That is one reason for the copying of output and errors to files (so we can unstage them back to the scheduler running outside the cluster).

Ah, so that's why it doesn't make sense to support cwd/pwd options: they don't translate well between different hosts using different temp dirs for storing these between runs.

As commented on the other issues, there isn't a problem that can't be solved by staging more files and improvising scripts with export A=b ; cd /c; <script>.

Let me know what you think.

I'm looking to port a codebase to redun that handles its own data loading on workers, and uses subprocess and tempfile to manage all the shelling out - so I won't need any staging/unstaging functionality at all. Because of that, I don't need to be using script() for this - thanks for the clarification! I'll just have my redun tasks above my calls to subprocess.

And also the valid-UTF8 problem is a very specific one - it makes sense to keep handling it internally in my own codebase until more people complain about it (if ever)

mattrasmus commented 4 weeks ago

script() now accepts an argv list as a first argument.

insitro / redun

script() does not accept argument list, only works with valid UTF-8 string #80