Reporting indexes - Githubissues

bakkot commented 2 years ago

This came up here, but I wanted to split it out to track it explicitly.

I think that reporting the indexes of arguments within args would be a helpful addition, because it lets you build a bunch of more complicated logic on top of parseArgs without needing to do a full re-parse.

For example, this would make it straightforward to:

resolve conflicts between --foo and --no-foo as last-wins
throw on duplicate flags (check if the flag occurs at multiple indexes)
if someday defaults are added, check to see if an option was provided by a user or came from the default (https://github.com/pkgjs/parseargs/issues/70#issuecomment-1071056359)
enforce that value-taking arguments are all specified using --foo=bar rather than --foo bar (check each to see if args[index].includes('='))
enforce that two flags are specified in a specific order (--enable-experimental-options before --some-unstable-option, e.g.)

ljharb commented 2 years ago

enforce that value-taking arguments are all specified using --foo=bar rather than --foo bar (check each to see if args[index].includes('='))

i am very interested in enforcing this!

shadowspawn commented 2 years ago

Playing with this idea (and https://github.com/pkgjs/parseargs/issues/70#issuecomment-1073368566):

  parseDetails: {
    options: {
      foo: { argIndex: 1, argIndices: [1] },
    },
    positionals: [
       { argIndex: 3 }
    ]
    }

shadowspawn commented 2 years ago

I was thinking about how to implement and use this, and a couple of complications:

the returned index is into an array of args that might have been determined by parseArgs internally. The "parseDetails" could include a property for (say) "originalArgs".
short option groups mean multiple options could have the same index (which affects two ordering use cases of index: enforcing order of options, and last overlapping option wins)

bakkot commented 2 years ago

might have been determined by parseArgs internally

It's either the author-provided thing or process.mainArgs, right? As long as it's no more complicated than that, it seems like the script author is unlikely to have trouble figuring out the appropriate array to index. Though I also don't see a problem with returning originalArgs.

short option groups mean multiple options could have the same index (which affects two ordering use cases of index: enforcing order of options and last overlapping option wins)

Hm, yeah, the author would need to write slightly more complicated logic to handle these cases. It's especially annoying with the awful -xyzSTRING style, because then you have to figure out where the "flags" part stops and the "string value" part starts, and can't just use lastIndexOf.

If we want to make this case easy, in the case of short options there could be an additional shortOptionsGroupIndex property or something, which is an index within that string. So { index: 3, shortOptionsGroupIndex: 2 } would correspond to something like the f in --a --b --c -xf.

(I'm not sure if the shortOptionsGroupIndex should count from the - or the character after it, hm. Probably from the -? For the main use cases, of comparing locations, it doesn't end up mattering.)

shadowspawn commented 2 years ago

🔨 Moving towards MVP Milestone 1 (#87)

I propose post-MVP.

shadowspawn commented 2 years ago

Rather than returning a structure indexed by parsed options, I am thinking about returning an array for each thing the parser discovers. e.g. option, option-terminator, positional. The option elements might have information like whether the option was specified with a short or long name, whether the option values was embedded or in a separate argument, whether the option was auto-discovered or a known option.

@bcoe and @bakkot have previously mentioned AST (Abstract Syntax Tree) as concept. I don't have experience with AST so coming at it from first principles and CLI experience.

@darcyclarke is trying something similar in minargs: https://github.com/darcyclarke/minargs/commit/f7f4bd4310d216615528a128563d918997866e07

bakkot commented 2 years ago

I think it'll be hard to evaluate the difference between that (index -> option) and the originally proposed inverse of that (option -> indices) without writing them out in a little more detail and seeing what different use cases look like with each style.

shadowspawn commented 2 years ago

What got me thinking in the array direction was disliking needing two indices to tell the order of short options from a short option group. I am imagining a short option group would generate multiple (flat) elements in the array, and not a tree.

I can imagine the current parseArgs results being generated from the AST-ish array with a simple loop calling storeOption on each option, adding each positional to positionals. Eating our own dog food.

But only in my head. Not written any details. :-)

shadowspawn commented 2 years ago

Playing with a prototype. I'm not trying to make decisions, so feel free to ignore until after the upstream PR lands!

% node index.js -ab --foo=bar -- --pos 
{
  originalArgs: [ '-ab', '--foo=bar', '--', '--pos' ],
  ast: [
    {
      symbol: 'option',
      optionName: 'a',
      short: 'a',
      argIndex: 0,
      value: undefined,
      valueIndex: undefined
    },
    {
      symbol: 'option',
      optionName: 'b',
      short: 'b',
      argIndex: 0,
      value: undefined,
      valueIndex: undefined
    },
    {
      symbol: 'option',
      optionName: 'foo',
      short: null,
      argIndex: 1,
      value: 'bar',
      valueIndex: 1
    },
    { symbol: 'option-terminator', argIndex: 2 },
    { symbol: 'positional', value: '--pos', argIndex: 3 }
  ]
}

bakkot commented 2 years ago

Some thoughts on that prototype:

When aliases are defined, I think that the optionName should be the long option. That is, if you have foo: { alias: 'f' }, then -f should be represented as { optionName: 'foo' }. This is so a script author can do ast.findLastIndex(x => x.symbol === 'option' && x.optionName === 'foo'), without having to worry about handling aliases explicitly.
Why make short be the character itself instead of a boolean? You can look up the character from the config, if you need to.
For distinguishing --foo=bar from --foo bar (assuming a type: 'string' option), it may make more sense to have a boolean along the lines of inlineValue, rather than having a valueIndex which is always either the current index or the next index.
From your example it's not clear if --foo bar (assuming a type: 'string' option) would end up as two elements, or just one where valueIndex is different from argIndex. I'd hope just a single element.
The convention is to name the field which identifies the type of thing as type, not symbol.
Maybe name the field elements or parseElements? "AST" is kind of technical term, not something most users would know.

And here's what the use cases in the OP would look like, given those tweaks:

resolve conflicts between --foo and --no-foo as last-wins

let { values, positionals, elements } = parseArgs(config);
let getLastIndex = name => elements.findLastIndex(x => x.type === 'option' && x.optionName === name);
if (values.foo && values['no-foo']) {
values.foo = getLastIndex('foo') > getLastIndex('no-foo');
}

throw on duplicate flags

let { values, elements } = parseArgs(config);
let dupes = Object.keys(value)
.map(name => elements.filter(e => e.type === 'option' && e.optionName === name))
.filter(e => e.length > 1);
if (dupes.length > 0) {
throw new Error(`duplicate flag${dupes.length > 1 ? 's' : ''} ${dupes.join(', ')}`);
}

enforce that value-taking arguments are all specified using --foo=bar rather than --foo bar

let { values, elements } = parseArgs(config);
for (let element of elements) {
if (element.type === 'option' && !element.short && element.valueIndex != null && element.valueIndex > element.argIndex) {
throw new Error(`argument ${element.optionName} must specify its argument as \`--foo=bar\`, not \`--foo bar\``);
}
}

enforce that two flags are specified in a specific order

let { values, elements } = parseArgs(config);
let getFirstIndex = name => elements.findIndex(x => x.type === 'option' && x.optionName === name);
let getLastIndex = name => elements.findLastIndex(x => x.type === 'option' && x.optionName === name);
if (values.first && value.second && getFirstIndex('first') > getLastIndex('second')) {
throw new Error(`"first" must be specified before "second"`);
}

These are all ok, I guess?

I notice a lot of type === 'option' filtering, but maybe that's unavoidable. (Actually it's not necessary in these examples, because optionName is only present on type: 'option' elements, but it feels kind of gross not to have that guard.)

Also I'm a bit worried that people are going to conflate "index in this array" with "index in the original array", which works for positional --foo --bar but fails when there's short options or options with values. Again, maybe that's unavoidable, though.

shadowspawn commented 2 years ago

When aliases are defined, I think that the optionName should be the long option.

Yes. It is the thing to use for lookups into the supplied config and returned values.

Why make short be the character itself instead of a boolean?

I was thinking about supplying enough information to display an error message without looking up option in config. But a bit subtle to use, can look it up as you say, and I think a boolean might be clearer.

For distinguishing --foo=bar from --foo bar (assuming a type: 'string' option), it may make more sense to have a boolean along the lines of inlineValue, rather than having a valueIndex which is always either the current index or the next index.

I actually started with that but tried the index, and it wasn't compelling. I agree, inlineValue would be better.

From your example it's not clear if --foo bar (assuming a type: 'string' option) would end up as two elements, or just one where valueIndex is different from argIndex. I'd hope just a single element.

Yes, single.

The convention is to name the field which identifies the type of thing as type, not symbol.

I avoided "type" in first instance because we use that in input configuration. Would "elementType" be ok? (Per next suggestion.)

Maybe name the field elements or parseElements? "AST" is kind of technical term, not something most users would know.

Yes. (I was over-reaching with AST! 😊 )

bakkot commented 2 years ago

I was thinking about supplying enough information to display an error message without looking up option in config

Yeah, I don't think "you don't have to look at the config" should be a goal. It's necessarily on hand, so it's not going to be hard to get to.

I avoided "type" in first instance because we use that in input configuration. Would "elementType" be ok? (Per next suggestion.)

Ehhhhh that's gonna be painful to write out every time. kind is the name I generally use if type doesn't work. This annoys the type theorists, but that's ok.

shadowspawn commented 2 years ago

Updated prototype per suggestions:

% node index.js -ab --foo=bar -- --pos 
{
  values: [Object: null prototype] { a: true, b: true, foo: 'bar' },
  positionals: [ '--pos' ],
  originalArgs: [ '-ab', '--foo=bar', '--', '--pos' ],
  parseElements: [
    {
      kind: 'option',
      optionName: 'a',
      short: true,
      argIndex: 0,
      value: undefined,
      inlineValue: undefined
    },
    {
      kind: 'option',
      optionName: 'b',
      short: true,
      argIndex: 0,
      value: undefined,
      inlineValue: undefined
    },
    {
      kind: 'option',
      optionName: 'foo',
      short: false,
      argIndex: 1,
      value: 'bar',
      inlineValue: true
    },
    { kind: 'option-terminator', argIndex: 2 },
    { kind: 'positional', value: '--pos', argIndex: 3 }
  ]
}

bakkot commented 2 years ago

@shadowspawn Now that valueIndex has become inlineValue, I think it would make sense to replace argIndex with just index. Otherwise, that's looking pretty good, I think.

bakkot commented 2 years ago

I'm still a little worried about the potential for people assuming that elements and args are 1-to-1, which is not true in the above design. So here's a possible alternative which occurred to me, where we maintain a closer relationship between the two lists. It seems like it would be a bit harder to use than the design above, so this is mostly presented for the sake of exploring the design space.

Assuming config is

{
  'ex': { type: 'boolean', short: 'x' },
  'why': { type: 'boolean', short: 'x' },
  'see': { type: 'string', short: 'C' },
  'foo': { type: 'string' },
}

then running node index.js -xy -C4 - xC5 -xC 6 --ex --foo=bar --foo bar -- pos would give

elements: [
  { kind: 'short', options: ['ex', 'why'], value: undefined, inlineValue: undefined },
  { kind: 'short', options: ['see'], value: '4', inlineValue: true },
  // in short option groups, `value` if defined is for the last element of `options`:
  { kind: 'short', options: ['ex', 'see'], value: '5', inlineValue: true },
  { kind: 'short', options: ['ex', 'see'], value: '6', inlineValue: false },
  // the value for non-inline values is present both under the `value` key in the option's element and as its own element
  { kind: 'value', value: '6' },
  { kind: 'long', options: ['ex'], value: undefined, inlineValue: undefined },
  { kind: 'long', options: ['foo'], value: 'bar', inlineValue: true },
  { kind: 'long', options: ['foo'], value: 'bar', inlineValue: false },
  { kind: 'value', value: 'bar' },
  { kind: 'option-terminator' },
  { kind: 'positional', value: '--pos' },
]

Note that this preserves elements.length === args.length.

And just for example, here's what "resolve conflicts between --foo and --no-foo as last-wins" looks like with this design:

let { values, positionals, elements } = parseArgs(config);
let optElements = elements.flatMap(e => e.options ?? []);
if (values.foo && values['no-foo']) {
  values.foo = optElements.lastIndexOf('foo') > optElements.lastIndexOf('no-foo');
}

... ok actually that's not as bad as I thought it was going to be. flatMap is great but I guess this would be a lot trickier if you're not comfortable with flatMap.

shadowspawn commented 2 years ago

Now that valueIndex has become inlineValue, I think it would make sense to replace argIndex with just index.

I still prefer argIndex for a bit more context about what it is? I know you like short names, but none of your earlier examples actually needed it (after switch to inlineValue) so not heavily used.

I'm still a little worried about the potential for people assuming that elements and args are 1-to-1, which is not true in the above design.

Hmm, I'm not expecting this to be a problem. I'm hopeful either confusion won't occur, or it won't come up in use as working with just elements and not referencing back to args.

So here's a possible alternative which occurred to me, where we maintain a closer relationship between the two lists. It seems like it would be a bit harder to use than the design above, so this is mostly presented for the sake of exploring the design space.

I agree the 1:1 is a bit harder to use. (Thanks for working an example.)

Edit: albeit flatMap made it similarly concise for that example, but I have never used flatMap!

shadowspawn commented 2 years ago

A minor musing on name of returned originalArgs. I prefer to avoid overlap of property names on input configuration and output result, but args might be an exception since actually same thing. If you supply args with configuration, then you don't need the returned args, so less potential for name clash of variables?

const { values, positionals, elements } = parseArgs({ args, strict });
// or
const { values, positionals, elements, args } = parseArgs({ strict });
// in either case, args is what got parsed

bakkot commented 2 years ago

I still prefer argIndex for a bit more context about what it is?

I don't think the extra context is useful, personally. There's only one thing it could really be. And odds are you're going to find out about it by reading existing code or documentation in any case, which should make it very clear.

Still, I don't feel super strongly. It just feels redundant to me.

Hmm, I'm not expecting this to be a problem.

Mostly I mention it because I made exactly that mistake when writing up this snippet despite having already known they were different. It's just very easy to think "i want the index" and reach for .findIndex() or .indexOf() rather than .find().argIndex.

I prefer to avoid overlap of property names on input configuration and output result, but args might be an exception since actually same thing.

Yeah, if we're going to return args at all then it is fine to name it args. (Though I really really do not think is necessary - most functions do not return their input and there is nothing in particular about this function which should make us deviate from that. The only reason to do so is that the default value for args isn't readily available, but that should be fixed by adding process.mainArgs, not by returning the input.)

shadowspawn commented 2 years ago

(1:1) Mostly I mention it because I made exactly that mistake

Oh, your experience trumps my speculation! Something to be aware of with naming and documentation.

(args) Though I really really do not think is necessary ... but that should be fixed by adding process.mainArgs

All fair points. I am not aware of any activity on mainArgs. (There was a brief PR, albeit less sophisticated than our placeholder: https://github.com/nodejs/node/pull/29990/files)

bakkot commented 2 years ago

@shadowspawn Having thought about it some more, I think the usability benefits of having at most one option per element probably outweighs the benefits of having the indices line up, so probably we should stick with your original design. People will definitely make the same mistake I did, but I don't see a way to avoid that without taking too large of a hit to usability.

Do you want to make a PR? I can put one together at some point if not.

shadowspawn commented 2 years ago

I would like to make a PR. I can open a draft now if anyone would like code to run against for experimenting. Functional, but some distance from final form (Refactor, primordials, et al.)

pkgjs / parseargs

Reporting indexes #84