gsethi / addama

Automatically exported from code.google.com/p/addama
Apache License 2.0
1 stars 0 forks source link

Improved behavior of script-execution-svc for command line parameters #30

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
ScriptExecutionSvc's argument parsing is blind to quotes, the quotes it adds 
tend to damage the results, "&"s and "=" in the input would need to be 
double-escaped to be distinguished from the ones implicit to the query string 
and then decoded by the target script to avoid misbehavior, and the parser 
needs to be rewritten to avoid doing this damage.

The behavior ScriptExecutionSvc currently engages in is, therefore, 
impractical. This isn't a matter of "working as designed"; perhaps it is, but 
an oddly-rearranged already-decoded HTTP query string is not the standard way 
of passing input to a Linux command-line tool.

We expect these scripts to run under a Linux environment, so it seems 
reasonable to tune our interface to Linux expectations. That said, different 
applications may have very divergent command-line needs (some may require input 
to be in the form of configuration files instead of on the command line at 
all), so the system should be flexible to allow for versatile output.
Furthermore, some frontends may want to pass data in the form of JSON, while 
other applications may want key-value pairs in the URL. Accommodating both of 
these is not very difficult and would significantly simplify use of the service.

A script as exposed by Addama is, generally, a single command-line application. 
It may be accessed by any number of frontends with a set of standardized HTTP 
queries. While most applications will have a "standard" user interface, 
applications that get hooked by other automated tools may not be so stapled to 
that interface.

As such, the server should decide how to interpret arguments, but the client 
should decide how to present them. Consequently, the Java/getopt decision 
should be part of the service configuration, but the service should be ready to 
run off of either JSON-formatted data or URL-form-encoded data.

(There is one more option for input formats, and that is "legacy": 
specifically, the exact behavior Addama currently provides, intended only for 
backwards compatability as it is strictly less expressive than either of the 
other two formats, and thoroughly nonstandard. A fourth format- "squished"- can 
be considered- roughly, "squished" is what Addama's current behavior was 
intended to be, in which the de-escapedHTTP query string is presented as the 
first argument. Addama's behavior was never equivalent to the "squished" 
format, so there is no legacy code it can possibly support, and "squished" is 
impractical compared to the other formats, so there is no sane reason to 
implement this mode. Features start out at minus 100 points.)

I propose the following:

    * A new per-script field in script-execution-svc.config, "argformat"
    * ScriptExecutionSvc accepts either form-encoded URLs (with the appropriate Content Type), or JSON data containing only strings and nulls in the POST body (with content type text/plain)

The two considerations are largely independent, so the correct architecture is 
an input processor converting the request into an intermediate "ScriptParams" 
object (a String-to-String hash table and an array of String), and a second 
component converting the ScriptParams and the script info into the String[] to 
be passed to Runtime.exec. 

Both new output formats have two distinct parts:

    * A collection of unordered key-value pairs (including key-null pairs, which get special treatment)
    * An ordered list of positional parameters

This is exactly what both GNU getopt and Java's -D formats rely on. Edited 
after ten minutes of research: -D is not widely used and we probably shouldn't 
bother supporting that, either; we should throw it all in behind getopt. They 
have a section of key-value pairs mixed with flags, followed by some number of 
positional parameters.  So the inputs are the same- the expression is just 
mildly different. Getopt uses --name=value, while Java uses -Dname=value, and 
each is literally followed by its positional parameters. Flag options are 
--name or -Dname=true.

Once we have interpreted our query as a set of key-value pairs and a list of 
parameters, figuring out the output to the script is reasonably easy. (We don't 
even have to solve quoting problems- JSON standards and %-escaping let us 
simply trust the query string. Since exec does not parse the command line and 
just passes the args through when called with a String[] as the first 
parameter, we don't have to escape anything ourselves.) Now we need to 
represent the key-value (or key-null) pairs, plus the params list, twice.

As a matter of convention, because it is a highly reserved symbol to bash, 
parameter names almost never start with $ in Linux software. We can therefore 
use that symbol to mean almost exactly the same thing it means in bash while 
running little risk of breaking compatibility with the command-line format of 
software. (In any event, this comes out orders of magnitude closer to "natural" 
command lines than does the current format of ScriptExecutionSvc, so we can't 
lose.)

With these considerations in mind, the JSON format is easier to describe.
One JSON object containing:
Any number of arbitrary primitive elements, using any legal names other than 
$ARGS
Any number of arbitrary null elements, using any legal names other than $ARGS
An array of String elements named exactly $ARGS, which may not have null 
elements, but may be null, and may be missing

(Null elements are a special type of primitive, but calling them out seems 
appropriate.)
Key-value pairs are created from any primitive other than a Boolean that 
contains a value, including the empty string. Key-null pairs are created from 
null values, and are interpreted simply as flags. Key-null pairs are also 
created from Boolean values containing true. Boolean values containing false 
are discarded as though they were not present- they represent flags that should 
not be set. (While the efficient thing to do is exclude them from the output, 
it is very likely to be easier to write an interface that simply permits them 
to be boolean false.) Be aware that this gives the Boolean value true and the 
String value "true" extremely different behaviors, and similar for false and 
"false".

Positional parameters are generated from $ARGS. Every element will be cast to a 
String, and nulls will be skipped. This is overly tolerant, because the spec 
defines that $ARGS must contain only Strings and no nulls, but this lenience 
will make Javascript development more practical. Elements are in the order they 
appear in $ARGS, and the only exception is the deletion of null values and 
conversions to string. They will be placed, unmodified, in order, on the 
command line. (Note that this allows for full functionality with programs that 
are not shaped like getopt- the entire interface of the program can be written 
in an ordered way in $ARGS. This is entirely by design.)

The form-encoded query interface is quite different. While JSON has data types, 
a form has only strings. Each parameter of the query is in the form 
name=percent-encoded string, and parameters are separated by "&". Java's 
libraries have built-in parsing for this, so we will largely ignore the details 
of the string: we have parameters and values, and that is all we need.

Parameters that do not have a name starting with "$" are treated as options or 
flags. Their key-value pair is created from the name and the unescaped form of 
their parameter. If the parameter is empty (the = is immediately followed by an 
&, which is legal in HTML forms to represent the blank string), it is a 
key-null pair treated as a flag rather than an argument. (The URL-encoded form 
is, therefore, less expressive than the JSON form, because JSON can 
specifically represent an empty string as separate from a null.) "true" and 
"false" have no special meaning and are always treated as strings with those 
values.

Parameters that start with $ are checked to see if they match the pattern of 
$n, where n is a positive integer of any magnitude. If it is something other 
than $n, then it is interpreted literally as a named parameter. If it is $n, 
however, it is treated as the nth positional prameter, and the name $n 
therefore will not exist in the output. If one really, really needs to use a 
flag named "$0" or something, use the first positional parameter (or first few 
positional parameters, more likely) to explicitly write out the getopt flag 
syntax. In general, flags with names starting with $ should probably be avoided 
for reasons of "that's what an environment variable looks like" anyway.

Positional parameters do not need to be consecutive or listed in order. 
Form-encoded parameters are represented in a hash table, so order isn't 
preserved anyway. Requiring every parameter from $0 onwards to be specified is 
more strict than we need to be, and could actually involve quite a bit more 
bookkeeping than may be desired in a simple Javascript application in some 
situations. Most uses will go from $0 onwards, but once again, there is little 
harm here in being "more compatible than expected".

Interactions with the magical "label" parameter are an exercise for the reader.

Original issue reported on code.google.com by anorberg...@gtempaccount.com on 10 Jan 2011 at 7:18

GoogleCodeExporter commented 8 years ago
Using an HTTP query string may not be standard for Linux, but it is a well 
known standard.  It is simple, documented and widely adopted.  

However, I do agree that we should better adapt to existing command-line 
interfaces.  I would focus on integrating the commonly used programming 
languages at ISB (Matlab, R, Perl, Python, Ruby) and would de-prioritize Java, 
as it is not widely used by computational biologists.

I'm not sure if I understand what you propose with parameters that start with 
$.  Do you mean HTTP request parameters, or shell parameters?

You should also consider that this service is providing a REST API that is 
accessible by HTML forms, Ajax requests and command-line tools.  All of those 
clients should be accommodated.

Nicely done, keep up the good work!

Original comment by hrovira.isb on 12 Jan 2011 at 5:48

GoogleCodeExporter commented 8 years ago

Original comment by hrovira.isb on 12 Jan 2011 at 5:48

GoogleCodeExporter commented 8 years ago
With regard to $-named parameters, that was referring specifically to 
parameters to the form request form of the web API. A request looking like 
?breakfast=eggs&guests=2&$0=scrambled&$1=orange%20juice would be run as
script --breakfast=eggs --guests=2 scrambled orange juice

Be aware that "orange juice" is one parameter, not two. No quotes are added, 
that's just how we'd provide the parameters. The specially-named $0 and $1 
become positional parameters, not options.

Original comment by anorberg...@gtempaccount.com on 12 Jan 2011 at 10:15

GoogleCodeExporter commented 8 years ago

Original comment by hrov...@systemsbiology.org on 27 Dec 2011 at 7:43