Refuse to act if argv[1] contains spaces

Some notes on the platform-dependent behavior:

Linux does not split on spaces: see fs/binfmt_script.c
xnu (macOS) does split on spaces: see exec_shell_imgact in bsd/kern/kern_exec.c
FreeBSD used to split on spaces: see exec_shell_imgact in sys/kern/imgact_shell.c. They have an argument as to why:

/*
 * HISTORICAL NOTE: From 1993 to mid-2005, FreeBSD parsed out the tokens as
 * found on the first line of the script, and setup each token as a separate
 * value in arg[].  This extra processing did not match the behavior of other
 * OS's, and caused a few subtle problems.  For one, it meant the kernel was
 * deciding how those values should be parsed (wrt characters for quoting or
 * comments, etc), while the interpreter might have other rules for parsing.
 * It also meant the interpreter had no way of knowing which arguments came
 * from the first line of the shell script, and which arguments were specified
 * by the user on the command line.  That extra processing was dropped in the
 * 6.x branch on May 28, 2005 (matching __FreeBSD_version 600029).
 */

Indeed, on xnu you can see the following behavior:

neko-shogun:tmp geofft$ cat interp.c
#include <stdio.h>

int main(int argc, char *argv[]) {
  printf("argc = %d\n", argc);
  int i;
  for (i = 0; i < argc; i++) {
    printf("argv[%d] = '%s'\n", i, argv[i]);
  }
}
neko-shogun:tmp geofft$ cat script
#!/tmp/interp foo bar

hello
neko-shogun:tmp geofft$ ./script baz
argc = 5
argv[0] = '/tmp/interp'
argv[1] = 'foo'
argv[2] = 'bar'
argv[3] = './script'
argv[4] = 'baz'

There's no obvious way to tell which of the arguments came from where.

This is mostly a problem because of the first argument, that the parsing isn't the same. You could say "Why not just rejoin the arguments on spaces and then split them with a proper shell parser." But you don't know how many arguments to rejoin.

If you were happy with splitting on spaces alone, then the behavior would be fine. On xnu, #!/usr/bin/env X=Y sh works as you'd hope: it gets parsed as env X=Y sh ./script. So does #!/usr/bin/env sh -x, which gets parsed as env sh -x script. (On Linux, the first one gets parsed as env X=Y sh ./script, which goes into an infinite loop, and the second one gets parsed as env sh -x script, which fails.)

The Linux code points out in passing that, if the interpreter wants, it can just look at the file being interpreted and parse the shebang on its own:

         * We do not want to exec a truncated interpreter path, so either
         * we find a newline (which indicates nothing is truncated), or
         * we find a space/tab/NUL after the interpreter path (which
         * itself may be preceded by spaces/tabs). Truncating the
         * arguments is fine: the interpreter can re-read the script to
         * parse them on its own.

That gives us a way to determine which arguments came from where: we can just read the shebang and split it in the same way the kernel would, and then discard those arguments from argv. Then parse the shebang properly and append the remainder of argv.

twosigma / relexec

Refuse to act if argv[1] contains spaces #4