uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.24k stars 1.24k forks source link

New argument parser #4254

Open tertsdiepraam opened 1 year ago

tertsdiepraam commented 1 year ago

I talked about this a bit on Discord, but I think it's time to actually start talking about a new library I've been working on. So here goes: introducing the very creatively named uutils-args, a derive-based argument parsing library.

Why a custom argument parser

For a while now, we've been using clap as our argument parser, but over time I've been become increasingly frustrated with it. Not because it's a bad library, but because it doesn't fit the very specific needs of the coreutils and we constantly have to work around clap with a lot of verbose code. I looked into basically every Rust argument parser out there and none really fit our needs. So about a month ago, I started working on a parser based on lexopt specifically for uutils.

The whole design is too complex to explain here, so I wrote some documents in the repo. It has three subpages:

The library also integrates very nicely with some of the plans I had earlier for the --help pages (https://github.com/uutils/coreutils/issues/3181, https://github.com/uutils/coreutils/issues/2816), because the help text is read as markdown from a file in this library.

Current Status

The library supports many utils already, but a few important features are still missing. Some low-hanging fruit:

And more complicated:

There are also rough edges that can be smoothed out like improving the error messages and the output of --help.

It is also very much a product of me working on it alone, which means that most of the naming conventions make sense to me but maybe not to others. I would very much welcome feedback on any part of the API.

If you want to see the library in action, I suggest to take a look at some of the coreutils for which I have created tests (more to come in the future):

A full list is here: https://github.com/tertsdiepraam/uutils-args/tree/main/tests/coreutils. I found that most are relatively easy to port.

Downsides

Porting to this new library will have a few downsides.

  1. New contributors won't be familiar with this library.
  2. We won't have argument completion (until we implement ourselves in this library)
  3. We'll have to update uudoc.
  4. We don't get all the features and years of iterative improvement that clap has.

Proposed roadmap

Of course, I'm aware that this is gonna be a huge effort to port and we should carefully how (and even if) we want to do that. For now, I've come up with the following steps:

How to help out

For now, I need lots of feedback on this library, so I can get it ready, feel free to open an issue with whatever feedback you have or comment below. You can also help by picking up any of the open issues.

I'd be happy to answer any questions about this library :)

cc @sylvestre, @rivy, @jfinkels

[^1]: I think this is possible by having clap and this new library coexist in the repo for a while, though it will break completion and the online docs.

sylvestre commented 1 year ago

Interesting :) but I am a bit worried that we end up reinventing the wheel. Did you try to:

tertsdiepraam commented 1 year ago

but I am a bit worried that we end up reinventing the wheel.

I agree that this is something to watch out for. In my opinion the problems with clap have been stacking up too much and I think I found a solution that will simplify the code by a lot in some utilities. I'm trying to avoid reinventing the wheel as much as possible by using lexopt to deal with all the parsing complexity. We therefore get a lot of correct behaviour for free.

If you want to see how bad clap can get for our use case, you only have to look at this function in ls, which is 500 lines of workarounds for clap and was really complicated to get right:

https://github.com/uutils/coreutils/blob/c2e4da97da6361b39099a16e04088a7d41ad18b4/src/uu/ls/src/ls.rs#L431-L939

My assertion is that we can get rid of almost this entire function with the new library. I'll try to port ls as a test soon, so I can prove that assertion :). There are more utils with similar behaviour that we are not handling correctly right now.

  • ask clap developer if they would be willing to take some patches?

I have asked about a few things in the past:

However, most of the things I tried to fix here are more fundamental. clap seems to assume that every option maps to a single value, which is a great simplification if you have the luxury of defining new API's, but it doesn't work well for coreutils. This is too much of a difference that they won't be able to bridge. clap is also (rightfully) concerned about backwards compatibility, which will limit some changes.

The reason that I'm so confident that this works well, is that the generated code is very similar to GNU. In GNU, they have an iterator over arguments and change the configuration based on the encountered values. clap first collects all arguments and provides an API for the final values. This is fundamentally why we need to inspect indices and do complicated tricks with some arguments to match the behaviour. This new library does basically the same as GNU just with a derive API and is therefore a great fit.

I've listed most of the problems I found here: https://github.com/tertsdiepraam/uutils-args/blob/main/design/problems_with_clap.md, which shows how complicated these are to change in clap.

  • if we could write a clap extension for this?

I wish we could, but I don't know what that would look like. For example, I wouldn't know how to do that ls function above as an extension.

rivy commented 1 year ago

@tertsdiepraam , I'd love to have improvements in command line arguments support.

Your contributions have always been great and I'm happy to support your further experiments.

I would suggest that, if you decide to embark on this project, it should be built and developed in another repo and as it's own crate. We can set it up as a repo under the uutils banner.

I can understand @sylvestre's concern about duplication of effort and reinvention, but I empathize strongly with your position. I've contributed to clap a few times, and it's become a bit of a beast. I quite understand the "replace it" feeling that you are experiencing. πŸ˜„

sylvestre commented 1 year ago

yeah, we should probably do it :) I just want to make sure that we have really discussed about it

jfinkels commented 1 year ago

This sounds like a reasonable change to me, I've run into quite a few issues with argument parsing. So if we can make it easier to deal with silliness like

chmod u=rrrr+-+-+-www f

or

df --o

without too much effort, that would be great.

tertsdiepraam commented 1 year ago

What's the silliness you're referring to? Is it about parsing the permissions and non-ambiguous prefixes of long arguments? I have the second one implemented, but the first one will probably not be part of this library. You can define a type that implements a parser that will work though.

jfinkels commented 1 year ago

Yes, I was referring to parsing the permissions and non-ambiguous prefixes of long arguments. They were just two things that came to mind when I tried to recall trouble I had had with arguments in the past.

tertsdiepraam commented 8 months ago

10 months after opening this issue, I've resumed my work on this :). I think the next step is determining when we deem it feature-complete and battle-tested enough to be included here.

Here's what I have right now:

What's missing is:

The problem is that there's only so much I can do in a separate repo. We have to start properly integrating this if we want to find the rest of the actual problems.

There's also the issue that while we are migrating, the docs and manpages will probably be broken anyway, because we will temporarily have both clap and this parser in the repo. Unless we only work on a separate branch, but that seems suboptimal too. Happy to hear any suggestions for how we should deal with this. My solution is that we generate the manpages and docs, store them and then remove the code that generates it. That will give people a snapshot that will be pretty accurate for a while.

The markdown format for help is also missing at this point. I cut that out to actually get to something done. We can add it back in later.

If you want to see the library in action, I recommend looking at the tests that implement several coreutils.

BenWiederhake commented 4 months ago

Looks like I accidentally stumbled into another clap-impossibility: unattached multi-valued short-args, which clap would prefer not to fix, and I honestly agree that it's a silly feature that should have never been invented in the first place.

However, with clap it is impossible to implement shuf -ez hello world correctly.

EDIT: Yes, it is possible, I was just a bit too daft.

tertsdiepraam commented 4 months ago

That might be where I definitively draw the line of compatibility πŸ˜„ Holy cow, that's cursed.

BenWiederhake commented 4 months ago

I think it's pretty doable with uutil-args. If I get it to work, would you be interested, or is that too cursed?

More cursed stuff … ```console $ echo foo > a; echo bar > b $ shuf a b shuf: extra operand 'b' Try 'shuf --help' for more information. [$? = 1] $ shuf a b -e # A *later* argv string changes the interpretation of an *earlier* argv! b a $ ```
tertsdiepraam commented 4 months ago

Actually, I think this is because --echo does not take any arguments. So, it should be possible in both clap and the new parser.

Sort of related, I think we should also put some guidance in CONTRIBUTING.md about filing issues on our behalf. You obviously mean well and it's appreciated, but I'd like to minimize the time other maintainers have to spend on our problems and explore solutions on our side first. Especially with clap, we've had people file issues a few times without communicating with the uutils maintainers and I don't want Ed Page to get annoyed with us [^1] πŸ˜…

[^1]: Of course, he's been awesome about this and always provided friendly and detailed responses.

BenWiederhake commented 4 months ago

According to the GNU documentation, --echo does take arguments, and shuf a b proves that, kinda. But yes, I would implement it as a boolean flag, and then later error out if more than one "file" was supplied. Sorry, I'm getting off-topic.

I tried to keep the clap issue reasonably general, and in retrospect I agree that perhaps it would have been better to coordinate with you. Thankfully, nothing bad happened this time. :)

tertsdiepraam commented 4 months ago

Where do you see that --echotakes arguments? In the manpage it says

treat each ARG as an input line

with ARG referring to this line:

shuf -e [OPTION]... [ARG]...

and in the docs it says

Treat each command-line operand as an input line.

So I think it's just a boolean.

But, depending on the flag, the maximum number of arguments changes, which is supported by the new parser!

BenWiederhake commented 3 months ago

Here's another behavior that is impossible to replicate using clap:

$ date -Ilolwut -R -R  # parses "lolwut" before either "-R", and raises that error
date: invalid argument 'lolwut' for '--iso-8601'
Valid arguments are:
  - 'hours'
  - 'minutes'
  - 'date'
  - 'seconds'
  - 'ns'
Try 'date --help' for more information.
[$? = 1]
$ date -R -R -Ilolwut  # parses the second "-R" before "-Ilolwut", and raises that error
date: multiple output formats specified
[$? = 1]
$ date -R -Ilolwut -R  # In fact, it first tries to parse the value of "-i" before noticing duplicity
date: invalid argument 'lolwut' for '--iso-8601'
Valid arguments are:
  - 'hours'
  - 'minutes'
  - 'date'
  - 'seconds'
  - 'ns'
Try 'date --help' for more information.
[$? = 1]
$ date -R -Idate -R  # … so if the value to "-I" is valid, it notices the duplicity only afterwards.
date: multiple output formats specified
[$? = 1]

Doing this in uutils-args is actually also difficult, because Settings::apply does not (yet?) permit raising an error. See https://github.com/uutils/uutils-args/issues/112 for my suggestion how to change that.

jfinkels commented 3 months ago

From your examples, it seems like the parsing is just handled sequentially in order, terminating at the first error. Is that right?

BenWiederhake commented 3 months ago

@jfinkels: Yes, hence my proposal to implement it like this: https://github.com/uutils/uutils-args/pull/113