Discussion: refactoring to simplify program flow

perlancar commented 5 years ago

I'm thinking of a refactoring as follow.

Pass through to GoL early if there is no optspec() object involved. This will solve all compatibility issues with GoL, at least when user is not using optspec().
Construct a big common structure (%optspecs, and @optspecs to record the order of keys) that we use for ourselves (GLM) as well as for GoL. In this step, we: a) check required linkages: die when a required linkage is not provided; b) wrap all linkages: whatever the link dest type is, form a wrapper coderef linkage to ease setting default and checking 'required' property; c) pass information from the optspec() objects to this structure.

At the end of this step, all linkages will have been made explicit and we have constructed a paired version of option specifications. A few examples to illustrate. Example 1:

GetOptions(
    'scalar=s' => \$scalar, 
    'array=s' => \@array, 
    'hash=s' => \%hash, 
    'code=s' => \&code, 
    '<>' => &process_args.
);

This is the simplest case, all option specifications have explicit linkages and the. @optspecs and %optspecs will become:

@optspecs = ('scalar=s', 'array=s', 'hash=s', 'code=s', '<>'); # basically, to record the ordering
%optspecs = (
    'scalar=s' => {
        name => 'scalar',
        linkage => sub { $linkage_accessed{'scalar=s'}++; $scalar = $_[1] },
        required => 0,
    },
    'array=s' => {
        name => 'array',
        linkage => sub { $linkage_accessed{'array=s'}++; push @array, $_[1] },
        required => 0,
    },
    'hash=s' => {
        name => 'hash',
        linkage => sub { $linkage_accessed{'hash=s'}++; $_[1] =~ /(.+?)=(.*)/; $hash{$1} = $2 },
        required => 0,
    },
    'code=s' => {
        name => 'code',
        linkage => sub { $linkage_accessed{'code=s'}++; &code(@_) },
        required => 0,
    },
    '<>' => { 
        # no 'name' indicate arguments
        linkage => sub { $linkage_accessed{'<>'}++; &process_args(@_) },
        required => 0,
    },
);

Example 2:

%opts = (
    'array' => \@array,
);
GetOptions(
    \%opts,
    'scalar=s',
    'array=s' => \@array,
);

This is hash storage mode. When a linkage is not specified explicitly, we construct a linkage for it to assign the value to the %opts hash.

@optspecs = ('scalar=s', 'array=s');
%optspecs = (
    'scalar=s' => {
        name => 'scalar',
        linkage => sub { $linkage_accessed{'scalar=s'}++; $opts{scalar} = $_[1] },
        required => 0,
    },
    'array=s' => {
        name => 'array',
        linkage => sub { $linkage_accessed{'array=s'}++; push @{$opts{array}}, $_[1] },
        required => 0,
    },
);

Example 3:

GetOptions(
    \%opts,
    '<>',
);

We die because a linkage is always required for '<>'.

Example 4:

GetOptions(
    'scalar=s',
);

This is classic mode. When a linkage is not specified explicitly, we construct a linkage for it to assign the value to the caller's $opt_NAME package variable.

@optspecs = ('scalar=s');
%optspecs = (
    'scalar=s' => {
        name => 'scalar',
        linkage => sub { $linkage_accessed{'scalar=s'}++; ${"$caller\::opt_scalar"} = $_[1] },
        required => 0,
    },
);

Examples 5 and later uses optspec(). Example 5:

GetOptions(
    'scalar=s' => optspec(handler => \$scalar, summary => "...", description => "...", completion => sub { ... }),
    'array=s' => optspec(handler => \@array, required=>1),
    'hash=s' => optspec(handler => \%hash, default=>"key=val"),
    'code=s' => optspec(handler => sub { ... }, default=>"x"),
);

This is the case where all handler (linkage) is specified in the optspec object. $orig_handler refers to the original handler.

@optspecs = ('scalar=s', 'array=s', 'hash=s', 'code=s');
%optspecs = (
    'scalar=s' => {
        name => 'scalar',
        linkage => sub { $linkage_accessed{'scalar=s'}++; ${orig_handler} = $_[1] },
        summary => "...",
        description => "...",
        completion => sub { ... },
        required => 0,
    },
    'array=s' => {
        name => 'array',
        linkage => sub { $linkage_accessed{'array=s'}++; push @{$orig_handler}, $_[1] },
        required => 1,
    },
    'hash=s' => {
        name => 'hash',
        linkage => sub { $linkage_accessed{'hash=s'}++; $_[1] =~ /(.+?)=(.*)/; {$orig_handler}{$1} = $2 },
        required => 0,
        default => "key=val",
    },
    'code=s' => {
        name => 'code',
        linkage => sub { $linkage_accessed{'code=s'}++; &code(@_) },
        required => 0,
        default => "x",
    },
);

Pass to GoL's GetOptionsFromArray(), simply using:

GetOptionsFromArray($ary, map { $optspecs{$_}{linkage} } @optspecs);
Assign default values. This is accomplished simply by:

for (@optspecs) {

setting an undef to 'default' property doesn't make much sense, so we use defined() here
```
if (defined $optspec{$_}{default} && !$linkage_accessed{$_}) {
    $optspec{$_}{linkage}->( $optspec{$_}{default} );
    $linkage_accessed{$_}++;
}
```
}
Check that required options have been specified by user. This is accomplished simply by:

for (@optspecs) { if (!$linkageaccessed{$}) { die (defined $optspec{$}{name} ? "Option $optspec{$}{name}" : "Arguments")." required but not supplied"; } }

Compared to the old code, the new code:

reimplement GoL's setting of caller's $opt_NAME in classic mode (con? but this code is simple and won't likely to change)
reimplement GoL's setting of $opts{NAME} in hash-storage mode (con? but this code is simple and won't likely to change)
no longer keeps separate @go_opts_spec and @opts_spec (pro)
can make completion work with the current Complete::Getopt::Long (CGL), since CGL doesn't handle hash-storage mode nor implicit linkage yet (pro)
simplify program flow, because we "canonicalize" everything and reduce the number of conditionals in the program (pro)

perlancar commented 5 years ago

I'm sure I'm missing a thing or two.

tabulon commented 5 years ago

Trying to understand, focusing on behaviour at this stage:

Is there an intention to require a "handler" to be passed to optspec() (i.e. no DWIM), whenever GLM needs to do something with it later on (e.g. for processing "default" and "required") ?

This is somewhat suggested by the overall text, as well as examples 5, 6, 7.

Otherwise, why would GLM still die in 6 & 7 ; especially now that it would it be "armed" with all that info and trickery (wrapping subs, name, $caller\::*, ...)

How about the same question in hash-storage mode? ( recent comments and TODO test-cases would suggest DWIM when handler missing, but we don't see any trace of that here )

Maybe I am missing something?

perlancar commented 5 years ago

Otherwise, why would GLM still die in 6 & 7 ; especially now that it would it be "armed" with all that info and trickery (wrapping subs, name, $caller\::*, ...)

The examples were wrong. Examples 6 and 7 shouldn't die, I've removed them. I forgot about "classic mode". Thus the only case we die (i.e. we require a linkage to be explicitly provided) would be in the case of '<>'. This makes it even simpler.

Sorry to confuse you. This was the result of my own confusion :-p

tabulon commented 5 years ago

Ok, sigh :-) It's completely understable... GoL is a twenty year old thingy with tons of accumulated stuff and quite scarce docs for the part it ha[sd] been playing...

tabulon commented 5 years ago

Still focusing on behaviour, I am not sure if I agree with the following remark at step (4) :

 # setting an undef to 'default' property doesn't make much sense, so we use defined() here

But that's an off-topic subject, that can be discussed elsewhere.

tabulon commented 5 years ago

Coming back to the main topic,

With all that would be happening, detecting the case for a true by-pass (step 1) seems to be quite hard to achieve.

It's probably not very desirable, either. Will expand on this later.

perlancar commented 5 years ago

Coming back to the main topic,

With all that would be happening, detecting the case for a true by-pass (step 1) seems to be quite hard to achieve.

Wouldn't something like this suffice?

sub GetOptionsFromArray {
    my $ary = shift;
    unless (grep { ref eq 'Getopt::Long::More::OptSpec' } @_) { 
        return Getopt::Long::GetOptionsFromArray($ary, @_;
    }

perlancar commented 5 years ago

Note: Getopt::Long allows an option to receive multiple values at once (e.g. 'foo=s{1,3}'), but that shouldn't be an issue because the coderef linkage will be called for each value.

perlancar commented 5 years ago

Side note: Wrapping all linkages will also help in implementing #12.

tabulon commented 5 years ago

OK, added a few more stuff; in the same vein.

BTW, I had some more in store, but somehow can't submit those at this moment, probably due to a glitch in github. Will retry submitting those later...

In a nutshell, with those unsubmitted stuff, we should be able to close #16 and even go a little beyond.

A side question, apart from GLD, we probably also need to visit some other modules at some point to see what other features GLM can borrow/steal; without necessarily targeting full "Feature parity" with all of those.

For example, MooX::Options has a few interesting features (like autosplit and autorange ) that are not specific to M[oose] or even OO, and could simply be useful in any Getopt::* context.

Support for JSON syntax is yet such an example, with current support given by MooX::Options as well as Perinci.

What say you (in terms of the general approach)?

...

tabulon commented 5 years ago

Before I dive back into the main topic of this thread, I have a couple suggestions:

Suggestion 1) Terminology

Could we please adopt the general term "destination" (or "dest" for shorthand), gradually ditching both "handler" and "linkage"?

I know I am to blame on this, at least partially in as much as the term "linkage" is concerned.

True, GoL source code is full of references to the term "linkage" (both in comments and names of things).

And I could have sworn that GoL documentation employed the same -imho- ugly term. (Perhaps it did at one point, I don't know.)

In any case, currently, GoL documentation just talks about "destination" to refer to the same thing, which is much more readily understood, imho.

The term "dest" / "destination" has the advantage of having been around at least since the 1970s; with the same semi-ambigous (but well established) conventional meaning of "where stuff ends up".

Apart from the problem of the divergence (from GoL), the term "handler" is probably not that suitable in any case, since it is usually understood to denote CODE.

Regardless, in this scenario, the "handler" parameter could easily remain as a non-advertised and deprecated (eternal?) synonym of "dest" / "destination".

tabulon commented 5 years ago

Suggestion 2) Timelines

2.1)

Could we postpone final verdics about refactoring & design for some time (possibly several months), as we create and discuss a few "more stuff"; in other words while we come up with a mini "roadmap".

I am not suggesting to suspend the design discussion, but rather to pace it together with the the other (more behavioural) one.

2.2)

Meanwhile, we could perhaps quicky prepare a mini-release that merely includes:

stuff already merged recently
possibly, a quick & dirty solution (at least for the hash-storage case) concerning the "handler should not be needed -> DWIM" issue, as reflected by the TODO tests recently added.
possibly suggetsion 1 above (if we agree)

Frankly, the above suggested pace/timeliness would also work better for my own schedule; and probably yours, too.

But even that weren't the case, it just looks like the ritght way to go for GLM, imho.

What say you, @perlancar ?

tabulon commented 5 years ago

Coming back to the main topic:

Wouldn't something like this suffice?

sub GetOptionsFromArray {
my $ary = shift;
unless (grep { ref eq 'Getopt::Long::More::OptSpec' } @_) { 
return Getopt::Long::GetOptionsFromArray($ary, @_;
}

When we take your proposed text literally, yes it certainly would. I know realize that I had misunderstood the initial meaning.

HOWEVER, that's probably only because, reading along, I was implicitely trying to "correct it" in my mind, (please excuse the expression), assuming we still be trying to cater to at least one of the points raised further down (if not both), as explained below.

Now, assuming GLM literally shorcircuits to calling GoL in the absence of OptSpec objects, yes, many GoL non-regression tests would pass without a hitch.

BUT, is that really Joe would want/expect ?
(please see below for more on Joe and his veretan script nautilus.pl)

The case of Joe & `nautilus.pl`

Joe maintains a veteran perl script from the early milenium, that he probably didn't even write. Joe has heard that his script (nautilus.pl) can get an immediate facelift by just typing "::More", and then even more, if he plays along.

We see two problems emerging for Joe, who has just typed the 6 characters (::More)

1) No reward in terms of GLM "appetizers" (HelpText, completion, ...), for the "sweat" he has just put in; 2) Joe's eventual "sigh", telling himself "at least it still works as before" after the switch "::More" is in reality based on a very false assumption; because things could still go bad when he adds the first optspec()in there, and that could be happening a many months later....

Though it's still quite important for adoption, point n°1 above is still something Joe could eventually live with.

Point n°2, on the other hand, probably carries much more significant ramifications...

That's why it's important, imho, to converge on a single code-path.

In a way this is analogous to the fail (i.e. die) early approach within code.

Only this time, a human agent (Joe) is involved in the process and the time-gap involved could easily be weeks, months, or even years; instead of seconds or minutes of a program execution....

perlancar commented 5 years ago

A side question, apart from GLD, we probably also need to visit some other modules at some point to see what other features GLM can borrow/steal; without necessarily targeting full "Feature parity" with all of those.

For example, MooX::Options has a few interesting features (like autosplit and autorange ) that are not specific to M[oose] or even OO, and could simply be useful in any Getopt::* context.

Support for JSON syntax is yet such an example, with current support given by MooX::Options as well as Perinci.

What say you (in terms of the general approach)?

Wholeheartedly agree, GLD is not (never is) the final target to achieve, it simply has a nice feature set. Although as a reminder the feature creep is always a threat :-)

perlancar commented 5 years ago

Could we please adopt the general term "destination" (or "dest" for shorthand), gradually ditching both "handler" and "linkage"?

Yes we can, I prefer "destination" over "dest" for the property name since we currently don't use abbreviation for the other property names (required instead of req, description instead of desc). The way to go with this I think is to first accept both "destination" as well as "handler", but to remove all mention of the "handler" property (as well as term) from the documentation. And later set a date (e.g. 1 year) before removing the support for "handler".

Another thing that can use some additional clearing up is the term "option specification". It currently refers both to things like "foo=s", as well as the optspec() object. I'm not deciding anything on this right now.

A glossary at the beginning of the documentation is not a bad idea either.

perlancar commented 5 years ago

Suggestion 2) Timelines

2.1)

Could we postpone final verdics about refactoring & design for some time (possibly several months), as we create and discuss a few "more stuff"; in other words while we come up with a mini "roadmap".

I am not suggesting to suspend the design discussion, but rather to pace it together with the the other (more behavioural) one.

2.2)

Meanwhile, we could perhaps quicky prepare a mini-release that merely includes:
* stuff already merged recently

* possibly, a quick & dirty solution (at least for the `hash-storage` case) concerning the "handler should not be needed -> DWIM" issue, as reflected by the TODO tests recently added.

* possibly suggetsion 1 above (if we agree)
Frankly, the above suggested pace/timeliness would also work better for my own schedule; and probably yours, too.

But even that weren't the case, it just looks like the ritght way to go for GLM, imho.

What say you, @perlancar ?

Fine with this. BTW, do you happen to read the email I sent you @tabulon ?

perlancar commented 5 years ago

We see two problems emerging for Joe, who has just typed the 6 characters (::More)
No reward in terms of GLM "appetizers" (HelpText, completion, ...), for the "sweat" he has just put in;
Joe's eventual "sigh", telling himself "at least it still works as before" after the switch "::More" is in reality based on a very false assumption; because things could still go bad when he adds the first optspec()in there, and that could be happening a many months later....

Though it's still quite important for adoption, point n°1 above is still something Joe could eventually live with.

Ah of course, I forgot entirely about completion. Scrap the "fallback early to GoL in absence of optspec()" then.

tabulon commented 5 years ago

Although as a reminder the feature creep is always a threat :-)

So true... And such an easy trap to fall into...

At one point, once the no-brainers are out of the way, I hope we'll switch to a more critical nay-sayer mode in this respect.

And hopefully, at that point, we would also have a simple set of conventions and libraries upon which people simple can build upon when a feature seems to be "missing"; instead of having to reinvent the wheel (and all the plumbing that goes with it) when all they needed (and probably wanted) was to add snow-chains.

tabulon commented 5 years ago

... I prefer "destination" over "dest" for the property name since we currently don't use abbreviation for the other property names (required instead of req, description instead of desc).

Agreed;

That said, I would ultimately prefer to also have an abbreviated synonym for most, if not all, of these attributes for the usual reasons (like less typing, page width, ...).

But I am not hung up on it; it's just a little nice to have.

And right now, it's best to remain coherent with the already existing stuff.

So, "destination", it is.

tabulon commented 5 years ago

Another thing that can use some additional clearing up is the term "option specification". It currently refers both to things like "foo=s", as well as the optspec() object. I'm not deciding anything on this right now.

Completely agree on the need for that clarification. Let's sit on it for while...

A glossary at the beginning of the documentation is not a bad idea either.

Yupp, that would be very useful. It will probably end up at the end of the document, though; as glosseries tend to grow... We can always have few very important items defined at the top; and then a link to the complete glossary.

tabulon commented 5 years ago

Fine with this. BTW, do you happen to read the email I sent you @tabulon ?

No, I haven't. Perhaps it fell into spam. Trying to find it.

tabulon commented 5 years ago

Just found it. It had indeed wend into the spam folder. Will respond over email.

tabulon commented 5 years ago

Going thru this thread again, I just realized that, so far, I have only raised my concerns without mentioning the things I like about this proposal. Here we go:

Some potential nitpicks aside, I really like the dual data structures themselves, as well as the fact that they do not actually contain the reallly dirty state (i.e. %linkage_accessed).
I like the resulting simplification of code; especially concerning the "validation" stage, which could really benefit from such a simplification, imho.
I like the extra capabilities brought/promised, such as those that would enable more validation constraints (à la #12); but I am not sure the proposed technique (systematic wrapping) is the way to achieve those (please see separate comment below).

tabulon commented 5 years ago

In particular, I am not convinced that wrapping is really needed in all cases (CODE destinations apart), and more importantly, that it would be sufficient for implementing some of the more sophisticated validation constraints later on (patterns, ranges, closed lists, ....)

In fact, most validation constraints would need to access the final effective value of a given option; regardless of how that value came to be, which could be the result of any of the pathways below:

a) the option value might be given on the command line, which would resullt in $linkage_accessed->{...} being true
b) Or, it might come from a default parameter, which can artificially be made (like in the proposal) to cause $linkage_accessed->{...} to evaluate to true
c) Or, it might have been given as a default in the spec (for integer values), like in "baz=4"
d) Or otherwise, it might have simply been sitting there (at the destination) even before the call to Getoptions*()

It appears that the last two possibilities (c & d) mentioned above are simply being overlooked by the proposed solution. It's true that, if GLM would now be parsing the spec, (c) could be assimilated into (b).

However, (d) could not easily be handled with this "wrapper" approach, even if we were willing to capture the actual value passed into each individual wrapper (in addition to just flagging the call via %linkage_accessed ).

By the way, I don't think we can easily sweep (d) away, by just considering it to be a rare case, either. It is (and has been) a perfectly valid (and perlish) way of setting a default value for an option, like so:


use Getopt::Long;

our %opts = (
  foo => "boo",
  bar => 3,
);

Getoptions(\%opts, 'foo=s', 'bar=i', "baz=s");

say "foo: '$opts{foo}'";   
# foo : 'boo'              # provided that '--foo' was not given on the command line.... :-)

BTW, the above consideration is not at all specfic to hash-storage mode. It would be the same story in classic mode as well.

tabulon commented 5 years ago

Naturally, the above consideration brings about a more abstract, yet still interesting question with quite practical imlplications, regarding the semantics of required and default...

What does it mean for an option to be required?

Does it mean :

a) that the option MUST absolutely appear on the command line?
b) or that it MUST just end up with a defined value at the end of the day? (however that value happened to have come about).

Both definitions above might have some merits. And perhaps it might even be worth supporting both features at some point (with different names, of course).

But which one would be more readily undrestood by the mere word required?

In fact, as expected, similar considerations would also seem to apply to the implementation of the default parameter, which would culminate in the choice of the operator to be used when assigning a default upon detecting that an option has not been passed on the command line:

a) =
b) //=

perlancar commented 5 years ago

Naturally, the above consideration brings about a more abstract, yet still interesting question with quite practical imlplications, regarding the semantics of required and default...

What does it mean for an option to be required?

Does it mean :
* **a)** that the option MUST absolutely appear on the command line?

* **b)** or that it MUST just end up with a **`defined` value** at the end of the day? (however that value happened to have come about).
Both definitions above might have some merits. And perhaps it might even be worth supporting both features at some point (with different names, of course).

But which one would be more readily undrestood by the mere word required?

In fact, as expected, similar considerations would also seem to apply to the implementation of the default parameter, which would culminate in the choice of the operator to be used when assigning a default upon detecting that an option has not been passed on the command line:
* **a)** `=`

* **b)** `//=`

Yes, required means the option has to be specified on the command-line. It is okay if the option ends up getting an undef value (e.g. if the destination is a coderef and the coderef assigns an undef value or does not assign anything anywhere, i.e. just: sub { }).

If one wants to make sure that the option value is defined, then this is the job of the validation or type system. The validator can alternatively also make sure that the option value is undef, if one so wishes.

To compare with Perinci::CmdLine, a required argument:

args => { foo => {schema=>"str", req=>1} }

means that user needs to specify --foo, or supply it via configuration file. Specifying --foo-json null or foo = !json null in the config file is valid. However if the schema specifies that the value must be defined:

args => { foo => {schema=>"str*", req=>1} }

then an undef value will be rejected.

tabulon commented 4 years ago

Have been doing some thinking about this question of refactoring.

The main themes appear to be:

Design improvements (for ease of implementing more stuff)
Code clean-up (for ease of maintenance and new development)

The above doesn't necessarily reflect scheduling order, but rather the order in which I will jot down some notes.

tabulon commented 4 years ago

1. DESIGN IMPROVEMENTS

Naturally, we are not talking about actually implementing any features here, but laying the needed ground work for those.

Below are the things considered so far (with regard to refactoring and design improvements):

Consider:

simplifying the flow (#11)
revisiting data structures (#11, #7)
systematic wrapping of handlers (#11, #7)
starting to support rudimentary forms of cmdspec (#18)
parsing the gol-spec (at least enough for extracting the canonical name)

Consider making provisions for :

subcommands (#27) and --args-from (#31) => prepocessing of @ARGV
validation and possibly interdependence (#12, #13)
hidden and decorative option specifiers (#14, #15

... all the while keeping the larger picture in mind.

tabulon commented 4 years ago

2. CODE CLEAN UP

The primary aim here is easing code readibility and maintainibility.

GLM

GetoptsFromArray:

Consider:

breaking apart various blocks (mapping, completion, validation, etc..)
rewriting the bulk of the code as a wrapper

sub GetoptsFromArray {
   my $cfg = { caller_pkg => ...,  }
   _wrap_GetoptsFromArray(&Getopt::Long::GetoptsFromArray, $cfg, @_)
}

sub _wrap_GetoptsFromArray {
  my $orig = shift;
  my $cfg  = shift;
  ...
  $orig-> (...)
  ...
}

GOL

Later on, I will also jot down some notes on GOL itself and its painpoints in terms of extensibility -- as the slightest modification of behaviour requires copying in tons of code.

I do have some ideas on how to overcome those; but it can wait.

tabulon commented 4 years ago

An Idea

I am thinking of handling both of the following together and in a unified fashion:

revisiting data structures (#11, #7)
starting to support rudimentary forms of cmdspec (#18)

As a reminder, just like optspec, cmdspec will just be a shortcut for creating a new lightweight object.

So why not have that object do some of the work (especially in terms of mapping / transforming opt specs)?

I will come back with a more concrete proposal and some example code later on.

perlancar / perl-Getopt-Long-More