pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Migrate `argparse` CLI definition to a `pydantic` basis for most important commands #438

Closed simeoncarstens closed 1 month ago

simeoncarstens commented 5 months ago

As described in #433, we would like to add a HTTP API for looper. The challenge here is to keep HTTP API and CLI in sync, and in a call, @nsheff, @zz1874 and I decided to address this by replacing the argparse-based CLI command / argument definition with a definition based on pydantic models: several libraries (for example, pydantic-argparse, tyro and clipstick) allow to build CLI parsers with type checking from pydantic.BaseModel definitions.

So we agreed that our first task towards a HTTP API for looper is to redefine import CLI commands (looper {run,runp,check}) via pydantic models. Building an HTTP API that consumes data compliant with these models is then straightforward.

For now, we assume we will be using pydantic-argparse, and, as agreed with @nsheff, will build pydantic models that are compatible with it. This issue and comments below track the progress of this task.

simeoncarstens commented 5 months ago

We (@zz1874, I) wrote a hacky demo script that uses a CLI based on pydantic-argparse to run the hello_looper example. You can find it on a branch in our fork of the repository. It serves as a proof-of-concept of how to define CLI arguments in a pydantic model.

simeoncarstens commented 5 months ago

Going forward, I (this has not been discussed with neither @nsheff nor @zz1874) propose to organize the pydantic-based CLI rewrite as follows:

At the core, we introduce a dataclass called Argument, which holds all kind of information we need to know about a CLI argument / flag. That includes for now:

We then create a data structure (a tuple, or an enum) that holds an Argument instance for each argument there is. We can then draw from this pool of argument definitions to dynamically create pydantic models for each command via pydantic.create_model(). This approach allows us to define arguments only once and reuse these definitions easily for different commands. I am working on a first implementation of this approach in the following branch: https://github.com/tweag/looper/tree/tweag/pydantic-command-models

nsheff commented 5 months ago

Given that we're running into issues with packages that don't support pydantic 2.0 (see here for example: https://github.com/pepkit/pipestat/issues/127), and that pydantic-argparse appears to be at a dead-end, unresponsive as to whether pydantic 2.0 will ever be supported (https://github.com/SupImDos/pydantic-argparse/issues/48), we might want to rethink using pydantic-argparse, and instead, it might make more sense to just roll our own argparser from the Argument objects, for example, with something like this:

ttps://stackoverflow.com/questions/72741663/argument-parser-from-a-pydantic-model

nsheff commented 5 months ago

See also: https://github.com/edornd/argdantic

donaldcampbelljr commented 4 months ago

Given that we're running into issues with packages that don't support pydantic 2.0 (see here for example: pepkit/pipestat#127), and that pydantic-argparse appears to be at a dead-end, unresponsive as to whether pydantic 2.0 will ever be supported (SupImDos/pydantic-argparse#48), we might want to rethink using pydantic-argparse, and instead, it might make more sense to just roll our own argparser from the Argument objects, for example, with something like this:

I just ran into this issue with pydantic > 2.0.0 support.

Recently, we forced requirement for pephubclient >=0.4.0 ( see https://github.com/pepkit/looper/issues/453, https://github.com/pepkit/looper/commit/1c8a8ad138b1d5096a40118e1a9bfdd6ed8ca203).

As I continue working on migrating the arguments to the pydantic models, I am now running into dependency issues because pephubclient requires pydantic greater than 2.5.0 but pydantic-argparse depends on pydantic<2.0.0:

ERROR: Cannot install looper and pydantic-argparse==0.8.0 because these package versions have conflicting dependencies.

The conflict is caused by:
    pephubclient 0.4.0 depends on pydantic>2.5.0
    pydantic-argparse 0.8.0 depends on pydantic<2.0.0 and >=1.9.0

Unfortunately, I've gotten quite far in finishing the migration before realizing this issue (I realized it when I merged some downstream changes upstream).

donaldcampbelljr commented 4 months ago

There is a fork of pydantic-argparse that uses pydantic2: https://github.com/anastasds/pydantic2-argparse

I was able to implement it here and resolve dependency issues: https://github.com/pepkit/looper/commit/066e661bf2b55ad0a7d120147e7f81c23b629913

Interestingly, the fork (and now my looper implementation) requires calling the still available pydantic.v1 api:

import pydantic.v1 as pydantic

donaldcampbelljr commented 4 months ago

Another issue as I rework the tests:

For arguments that can be given a list (e.g. selecting or excluding based on multiple flags), the argparser cannot seem to handle a list:

--looper-config .looper.yaml --exc-flag "failed" "running" run --dry-run

I've tried different permutations of the syntax as well as changing the Field type but am still striking out.

simeoncarstens commented 4 months ago

@donaldcampbelljr interesting issue! I investigated a bit and the problem seems to be that pydantic-argparse doesn't know how to distinguish between list elements (such as "failed" and "running") and the subcommand ("run"). In fact, a related issue can be demonstrated with just the usual argparse Python module:

from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument("--somelist", nargs="+")
parser.add_argument("-v", "-voo")
subparsers = parser.add_subparsers(title="subcommands")
cmd_parser = subparsers.add_parser("cmd")
cmd_parser.add_argument("--somestr", type=str)

args = parser.parse_args()

yields

$ python argtest.py --somelist a b cmd --somestr "Asdf"
usage: argtest.py [-h] [--somelist SOMELIST [SOMELIST ...]] [-v V] {cmd} ...
argtest.py: error: argument {cmd}: invalid choice: 'Asdf' (choose from 'cmd')

One way to "fix" ( :roll_eyes: ) this is to use a non-list-valued argument before the subcommand, for example:

$ python argtest.py --somelist a b -v bla cmd --somestr "Asdf"

which parses correctly. But unfortunately it seems like a bug fix in pydantic[2]-argparse would be required for even that workaround to work :slightly_frowning_face: I'd be happy to look into how to fix this!

donaldcampbelljr commented 4 months ago

@simeoncarstens Interesting! Thank you for taking a look. I was also digging into it this morning. I noticed that the compute argument (also a list) does appear to be working,but it is added to the Run parser and, thus, comes after the run command:

looper --looper-config .looper.yaml run --dry-run --compute PARTITION=standard cores='32' mem='32000' 

this also works (moving the --dry-run after the list):

looper --looper-config .looper.yaml run --compute PARTITION=standard cores='32' mem='32000' --dry-run

I was thinking that one workaround would be to add all the shared arguments to the individual commands so that the list arguments can be used after the command parser. Not ideal but it might be necessary for this setup.

donaldcampbelljr commented 4 months ago

So, I tried out my proposed suggestion here: https://github.com/pepkit/looper/commit/9fdf85a518db8c429ef3a88206a122ac9cbeac01

It does seem to work. However, the position of the arguments (do they come before or after the command?) is still a bit opaque, e.g.:

looper --looper-config .looper.yaml  --output-dir "your/output/dir" run --dry-run --compute  mem='32000' --exc-flag "failed" "running"

I think I will proceed to place all of the arguments such that they come after the command. This will make the syntax similar to the previous CLI such that it will begin with looper run.

Is there any reason we should not structure it in this way? We will will have copies of the shared arguments among the parsers but I believe that is the only downside.

simeoncarstens commented 4 months ago

Another option is to monkey-patch thepydantic2-argparse. For example, we could require all list arguments to be given as comma-separated strings, and then add the following in cli_pydantic.py right after the imports:

def parse_field(
    parser: argparse.ArgumentParser,
    field: pydantic.fields.ModelField,
) -> Optional[utils.pydantic.PydanticValidator]:
    """Adds container pydantic field to argument parser.

    Args:
        parser (argparse.ArgumentParser): Argument parser to add to.
        field (pydantic.fields.ModelField): Field to be added to parser.

    Returns:
        Optional[utils.pydantic.PydanticValidator]: Possible validator method.
    """

    import pydantic2_argparse.utils as utils

    class SplitArgs(argparse.Action):
        def __call__(self, parser, namespace, values, option_string=None):
            setattr(namespace, self.dest, values.split(','))

    # Add Container Field
    parser.add_argument(
        utils.arguments.name(field),
        action=SplitArgs,
        help=utils.arguments.description(field),
        dest=field.alias,
        metavar=field.alias.upper(),
        required=bool(field.required),
    )

    # Construct and Return Validator
    return utils.pydantic.as_validator(field, lambda v: v)

pydantic2_argparse.argparse.parser.parsers.container.parse_field = parse_field

This is ugly, but would have the advantage that the actual command / argument hierarchy is maintained. If we do this, then of course the mandatory comma separation (e.g. --exc-flag=running,failed) would need to be very aggressively documented in the help string.

simeoncarstens commented 4 months ago

Putting the majority of shared arguments after the subcommand would also be possible, of course. As you say, there's then a duplication of arguments across various subcommands. But then again, it might even be preferred if you had a good reason to do so for the existing CLI. Note that then some of our changes would need to be reversed. In the end, I think it's as you and the users prefer :slightly_smiling_face:

donaldcampbelljr commented 3 months ago

Removing likely-solved after today's discussion. Cause: lack of short form arguments.

This was brought up previously in a PR: https://github.com/pepkit/looper/pull/448#issuecomment-1921642806

But to reiterate here: pydantic-argparse/pydantic2-argparse does not support short-form arguments at this time and that is currently undesirable.

We will hold on releasing the next version of Looper (1.8.0) until we incorporate the short-form arguments.

donaldcampbelljr commented 2 months ago

It appears that clipstick does allow short arguments and uses Pydantic 2: https://sander76.github.io/clipstick/usage.html#keyword-arguments

I believe it was mentioned in a meeting that we decided not to use clipstick for some reason, but I cannot seem to track down my notes on why that was decided.

simeoncarstens commented 2 months ago

I'm sorry about the late realization that short arguments are not supported by pydantic-argparse! We probably should have pointed that out more explicitly, rather than only in a comment on PR :slightly_frowning_face: As for clipstick, I think what I didn't like about it was that (quote from the documentation)

Next to that I wanted to try and build my own parser instead of using Argparse because… why not.

That doesn't exactly inspire confidence with me, but that's just opinionated on my side.

nsheff commented 2 months ago

Yes, clipstick doesn't use argparse, which I considered a fatal downside, since we use argparse in all our other projects so we're familiar with it.

donaldcampbelljr commented 1 month ago

Ok, with the latest PR I've added the short arguments. Marking this as likely solved.