grncdr / js-shell-parse

parse bash, with javascript (UNMAINTAINED)
MIT License
90 stars 13 forks source link

Shell syntax and terminology (substitution, subshells, etc.) #1

Open geoff-nixon opened 10 years ago

geoff-nixon commented 10 years ago

Couple of things:

I bring this up because: [Edit: note the discussion of the use of the term subshell in the comments that follow.]

I think it might be worth asking (given the name and the tagline you've given the project), it is your intention to parse out shell or bash? I'm not super hardline anti-bash, but there are some very challenging extensions in bash (things like [[ $* =~ (.*) ]] && BASH_REMATCH[1] which you might be able to avoid, at least for the time being, if you were to limit yourself to POSIX shell rather than bash, per se.

grncdr commented 10 years ago

Hey sorry I've been driving cross-country for the last couple of days, but I agree with all your suggestions/concerns and will address them when I get back to work on this. On Jan 3, 2014 4:54 PM, "Geoff Nixon" notifications@github.com wrote:

Couple of things:

  • <() and >() is process substitution, not command substitution.
  • $() and backticks are both command substitution; the first being the preferred syntax because nesting backticks is heinous. But they are 100% equivalent.

I bring this up because:

  • Command substitution does not necessarily invoke a subshell, and in most cases won't. The only thing it guarantees is out-of-order evaluation: it simply evaluates the expressions inside the substitution first, then evaluates the surrounding statement using the results of the substitution.
  • There aren't any builtins that explicitly mean 'subshell'; to do exactly that (without a compound action or backgrounding the task), its just sh -c ....
  • Process substitution always invokes a subshell; since it is equivalent to invoking the expression inside the substitution and redirecting its output to an anonymous file descriptor in a background task, then invoking the outer expression with its input being the contents of that file descriptor. Might also be worth noting that the <()syntax is a pretty notorious bashism http://mywiki.wooledge.org/Bashism, although its also present in ksh93 and zsh.
    • zsh (...and very recent bash?) also has the related =(), meaning file substitution, taking the same form as process substitution but using a regular file rather that a file descriptor or a FIFO.

I think it might be worth asking (given the name you've given the project), it is your intention to parse out shell or bash? I'm not super hardline anti-bash, but there are some very challenging extensions in bash (things like [[ $* =~ (.*) ]] && BASH_REMATCH[1] which you might be able to avoid, at least for the time being, if you were to limit yourself to POSIX shell rather than bash, per se.

— Reply to this email directly or view it on GitHubhttps://github.com/grncdr/js-shell-parse/issues/1 .

grncdr commented 10 years ago

Also, regarding POSIX and/or bash: I have been thinking I will probably sum to support POSIX first and bash-isms eventually. On Jan 3, 2014 4:54 PM, "Geoff Nixon" notifications@github.com wrote:

Couple of things:

  • <() and >() is process substitution, not command substitution.
  • $() and backticks are both command substitution; the first being the preferred syntax because nesting backticks is heinous. But they are 100% equivalent.

I bring this up because:

  • Command substitution does not necessarily invoke a subshell, and in most cases won't. The only thing it guarantees is out-of-order evaluation: it simply evaluates the expressions inside the substitution first, then evaluates the surrounding statement using the results of the substitution.
  • There aren't any builtins that explicitly mean 'subshell'; to do exactly that (without a compound action or backgrounding the task), its just sh -c ....
  • Process substitution always invokes a subshell; since it is equivalent to invoking the expression inside the substitution and redirecting its output to an anonymous file descriptor in a background task, then invoking the outer expression with its input being the contents of that file descriptor. Might also be worth noting that the <()syntax is a pretty notorious bashism http://mywiki.wooledge.org/Bashism, although its also present in ksh93 and zsh.
    • zsh (...and very recent bash?) also has the related =(), meaning file substitution, taking the same form as process substitution but using a regular file rather that a file descriptor or a FIFO.

I think it might be worth asking (given the name you've given the project), it is your intention to parse out shell or bash? I'm not super hardline anti-bash, but there are some very challenging extensions in bash (things like [[ $* =~ (.*) ]] && BASH_REMATCH[1] which you might be able to avoid, at least for the time being, if you were to limit yourself to POSIX shell rather than bash, per se.

— Reply to this email directly or view it on GitHubhttps://github.com/grncdr/js-shell-parse/issues/1 .

geoff-nixon commented 10 years ago

No worries! Have fun.

grncdr commented 10 years ago

I'm updating the grammar to name things correctly (processSubstitution and commandSubstitution) and referring to the POSIX shell specs, but they seem to contradict your point regarding subshells. I've included the relevant passages below to get your input. It's very possible I'm misreading them, or not far along enough on the journey to understand some subtlety of implementation that's going to bite me later.

First, this passage in the section describing command substitution (emphasis is mine)

The shell shall expand the command substitution by executing command in a subshell environment (see Shell Execution Environment) and replacing the command substitution (the text of command plus the enclosing "$()" or backquotes) with the standard output of the command, removing sequences of one or more s at the end of the substitution.

Further reading of the Shell Execution Environment section implies that a subshell environment is only optional in the case of a pipeline: (again, emphasis is mine)

Command substitution, commands that are grouped with parentheses, and asynchronous lists shall be executed in a subshell environment. Additionally, each command of a multi-command pipeline is in a subshell environment; as an extension, however, any or all commands in a pipeline may be executed in the current environment. All other commands shall be executed in the current shell environment.

geoff-nixon commented 10 years ago

I'm updating the grammar to name things correctly (processSubstitution and commandSubstitution) and referring to the POSIX shell specs, but they seem to contradict your point regarding subshells.

This is probably an abuse of the term subshell on my part, probably due to the bad influence of pages like this one. I was using the term roughly as it is "defined" at that link — that is, to mean when the shell forks another shell in an additional child subprocess. It's probably worth noting the page there is very confusing and self-contradictory, i.e:

In general, an external command in a script forks off a subprocess, whereas a Bash builtin does not.

...a bizarre statement, since the invocation of any external utility (simply by virtue of its being an external utility) will necessarily, always fork a subprocess... but that's not even relevant. So I don't really recommend using that link as any type of reference. But to reiterate, the condition I was referring to is when the shell has forked a child subprocess of another instance of the same shell; and I believe this is (maybe unfortunately) a rather commonly used meaning.

As you've rightly pointed out, the specification also uses the term subshell (or rather, subshell environment). My understanding of the distinction is that while a new subshell environment necessitates a freshly initialized Shell Execution Environment, it doesn't necessarily mean that a new process needs to be forked. This is probably easiest to understand in a historical context—I believe that the original Bourne shell would, indeed, fork a new process for each command substitution, and the purpose of the mandated subshell environment is to ensure that all implementations of command substitution continue to return results consistent with that behavior. So conceivably, one still could still simply always fork in any modern shell, as naturally it will return conformant results. However (as one can imagine), doing this is expensive and inefficient—so in actuality modern shells do not fork new shell processes unless they have to. Instead, they reinitialize the environment, perform the commands within the substitution, then return to the original environment, all within the same process.

Does that make sense?

grncdr commented 10 years ago

yes, that's been my reading of it as well (and I agree about the unfortunate and confusing use of the term "subshell").

The problem of implementing these semantics are for the yet-to-be-implemented interpreter, but I definitely appreciate you taking the time to help me think through these semantics early on. Due to some the design constraints I have in mind for the interpreter, implementing subshells within the same process will probably be easier than spawning a new process anyways, so there's not much risk of ending up with the super slow approach.

geoff-nixon commented 10 years ago

Right. I think (especially in light of the preceding confusion) that perhaps the term subshell should be avoided altogether, except as necessary when referring to external documents. I think you've already eliminated all uses of it thus far with your latest commit. From a lexicographical standpoint, I don't think it particularly has very much value, since its essentially an implementation-specific detail. One could simply note whether or not the environment needs initializing.

grncdr commented 10 years ago

yes the new grammar (and resulting AST) makes no reference to subshells anymore, It's just "command substitution" as far as the parser is concerned.

geoff-nixon commented 10 years ago

Regarding a JS-native shell interpreter: I've been meaning to try my hand at seeing how dash and mksh compile with emscripten for some time now actually. Since both are POSIX-compliant, decently portable, and liberally licensed, the results might be of some interest, either as starting point for writing an interpreter, or at least to compare against for accuracy. You think?

In any case, I think I'll retitle this issue to be more descriptive. Would you prefer it closed, or use it for further discussion? My own opinion, issue #1 is typically as good a place as any for discussing a young project, but I don't want for impose on your 'flow', so if you'd prefer an alternate venue...?