twosigma / relexec

A program to enable relative shebangs in scripts
Apache License 2.0
1 stars 0 forks source link

Refuse to act if argv[1] contains spaces #4

Open lordmauve opened 3 years ago

lordmauve commented 3 years ago

A possibility for future extension is to tokenise argv[1] as a command line ourselves. This is because, on Linux, only one argument is passed from the shebang line - effectively the shebang line is split once at the first space. Relexec requires an argument and thus "consumes" the available argument.

This behaviour is platform dependent, and the exact semantics of this are debatable, but to preserve the flexibility to pursue this direction, we could simply error out if the relative path argument (argv[1]) contains spaces. This would prevent users depending on a specific interpretation of spaces.

geofft commented 3 years ago

Some notes on the platform-dependent behavior:

/*
 * HISTORICAL NOTE: From 1993 to mid-2005, FreeBSD parsed out the tokens as
 * found on the first line of the script, and setup each token as a separate
 * value in arg[].  This extra processing did not match the behavior of other
 * OS's, and caused a few subtle problems.  For one, it meant the kernel was
 * deciding how those values should be parsed (wrt characters for quoting or
 * comments, etc), while the interpreter might have other rules for parsing.
 * It also meant the interpreter had no way of knowing which arguments came
 * from the first line of the shell script, and which arguments were specified
 * by the user on the command line.  That extra processing was dropped in the
 * 6.x branch on May 28, 2005 (matching __FreeBSD_version 600029).
 */

Indeed, on xnu you can see the following behavior:

neko-shogun:tmp geofft$ cat interp.c
#include <stdio.h>

int main(int argc, char *argv[]) {
  printf("argc = %d\n", argc);
  int i;
  for (i = 0; i < argc; i++) {
    printf("argv[%d] = '%s'\n", i, argv[i]);
  }
}
neko-shogun:tmp geofft$ cat script
#!/tmp/interp foo bar

hello
neko-shogun:tmp geofft$ ./script baz
argc = 5
argv[0] = '/tmp/interp'
argv[1] = 'foo'
argv[2] = 'bar'
argv[3] = './script'
argv[4] = 'baz'

There's no obvious way to tell which of the arguments came from where.

This is mostly a problem because of the first argument, that the parsing isn't the same. You could say "Why not just rejoin the arguments on spaces and then split them with a proper shell parser." But you don't know how many arguments to rejoin.

If you were happy with splitting on spaces alone, then the behavior would be fine. On xnu, #!/usr/bin/env X=Y sh works as you'd hope: it gets parsed as env X=Y sh ./script. So does #!/usr/bin/env sh -x, which gets parsed as env sh -x script. (On Linux, the first one gets parsed as env X=Y sh ./script, which goes into an infinite loop, and the second one gets parsed as env sh -x script, which fails.)

The Linux code points out in passing that, if the interpreter wants, it can just look at the file being interpreted and parse the shebang on its own:

         * We do not want to exec a truncated interpreter path, so either
         * we find a newline (which indicates nothing is truncated), or
         * we find a space/tab/NUL after the interpreter path (which
         * itself may be preceded by spaces/tabs). Truncating the
         * arguments is fine: the interpreter can re-read the script to
         * parse them on its own.

That gives us a way to determine which arguments came from where: we can just read the shebang and split it in the same way the kernel would, and then discard those arguments from argv. Then parse the shebang properly and append the remainder of argv.