Allow parsing multiple nested sub commands

janvainer commented 2 years ago

What I am trying to do:

I am training an encoder-decoder neural network and I have multiple different encoders and decoders. Each encoder and decoder has a separate dataclass config.

I want to be able to specify from the commandline which encoder to use by specifying its config name. Once I specify that, I also want to be able to adjust the selected config fields. So far I was able to accomplish this subparsers:

I have something like

@dataclass
class Model:
    encoder: Union[RNNEncoder, ConvEncoder] = RNNEncoder() 
    decoder: RNNDecoder = RNNDecoder()

And I simply parse it just like in the subparsers example. I then specify the encoder keyword and then it allows me to specify args specific to the selected architecture.

The problem is, if I want to do the same also for decoder, subparsers are not available anymore and I get

error: cannot have multiple subparser arguments

Ideal state

How can I solve my problem with simple_parsing? My ideal syntax would probably be to use the Union type in the dataclasses and call it like so:

python script.py --some_unrelated_args \
    encoder rnnencoder --args_related_to_rnnencoder \
    decoder convdecoder --args_related_to_convdecoder

Solutions & workarounds

There are quite many suggestions in this SO thread. The issue can be solved with multiple subparsers and user-side argv parsing. The following code allows me to call my script like so:

python script.py \
    --glob1.xx 7 \
    --glob2.xx 5 \
    encoder convencoder \
      --y 2 \
    decoder convdecoder \
      --n 7

And get

Namespace(decoder=Decoder(decoder=ConvDecoder(n=7)), encoder=Encoder(encoder=ConvEncoder(y=2)), glob1=Global1(xx=7, yy='hello'), glob2=Global2(xx=5, yy='hello'))

The code:

import sys
import itertools
from functools import partial

from dataclasses import dataclass

from typing import Union
from simple_parsing import ArgumentParser
from argparse import Namespace

@dataclass
class RNNEncoder:
    x: int = 1

@dataclass
class ConvEncoder:
    y: int = 2

@dataclass
class RNNDecoder:
    m: int = 3

@dataclass
class ConvDecoder:
    n: int = 4

@dataclass
class Encoder:
    encoder: Union[RNNEncoder, ConvEncoder] = RNNEncoder()

@dataclass
class Decoder:
    decoder: Union[RNNDecoder, ConvDecoder] = RNNDecoder()

@dataclass
class Global1:
    xx: int = 5
    yy: str = "hello"

@dataclass
class Global2:
    xx: int = 5
    yy: str = "hello"

parser = ArgumentParser()

parser.add_arguments(Global1, dest="glob1")
parser.add_arguments(Global2, dest="glob2")

sub = parser.add_subparsers()
encoder = sub.add_parser("encoder")
encoder.add_arguments(Encoder, dest="encoder")
decoder = sub.add_parser("decoder")
decoder.add_arguments(Decoder, dest="decoder")

def groupargs(arg, commands, currentarg=[None]):
    if(arg in commands.keys()):currentarg[0]=arg
    return currentarg[0]

rest = 'tmp.py encoder convencoder --y 7 decoder convdecoder --n 6'.split() # or sys.argv

argv = rest # sys.argv
commandlines = [(cmd, list(args)) for cmd,args in itertools.groupby(argv, partial(groupargs, commands=sub.choices))]
commandlines[0][1].pop(0)

namespaces = dict()
for cmd, cmdline in commandlines:
    n, r = parser.parse_known_args(cmdline)
    assert len(r) == 0, f"Provided unknown args {r} for command {cmd}"
    if cmd is None:
        namespaces["global"] = n
    else:
        namespaces[cmd] = getattr(n, cmd)

args = Namespace(
    **vars(namespaces.pop("global")),
    **namespaces,
)
print(args)

The result looks correct and leverages simple_parsing argument name resulotions etc, so its quite convenient. Some caveats and ugliness:

Each Union-like option must be in a separate subparser and needs a separate dataclass
Attributes from the Union-like classes are accessed like args.encoder.encoder, but it would be nicer to have args.encoder and get the final class straight away

What would be better:

Allow multiple Union fields in simple-parsing and handle the subparsers internally like described above
Directly fill in the user-selected class to the main class to allow args.glob1.encoder calls

The configs could look like this:

@dataclass
class Global:
    letter: Union[A, B] = A()
    number: Union[One, Two] = One()
    some_other_arg: int = 5

parser = ArgumentParser()
parser.add_arguments(Global, dest="glob")
args = parser.parse_args()

and we would get the same output like in the example.

Other stuff I tried:

I also tried to use Enum to store the configs for encoder types and it allows me to select correct config, but I cannot adjust the selcet config params. I tried to use choice but it did not accept dataclass as an argument.

idoby commented 2 years ago

Can you show Encoder1, Encoder2 and Decoder1/2 in your example?

janvainer commented 2 years ago

@idoby thank you for response! I adjusted the example to use Encoder and Decoder abstractions.

idoby commented 2 years ago

Looks like argparse doesn't support this easily, which kind of makes sense because this style is not very unix-like. A more standard command line would look like:

python script.py --some_unrelated_args \
    --encoder=rnnencoder --args_related_to_rnnencoder \
    --decoder=convdecoder --args_related_to_convdecoder

Would that be a good solution @janvainer ? It's not too hard to implement this with SimpleParsing but it is a little hacky. I suggest we add support for this kind of "polymorphic" command line in SP. What do you think @lebrice ?

I implemented something like this in my project in a few lines but it would be nice if SP had built-in support.

janvainer commented 2 years ago

Yeah this would be great, especially if its less hacky that what I suggested! :) As long as I can do

@dataclass
class Global:
    letter: Union[A, B] = A()
    number: Union[One, Two] = One()
    some_other_arg: int = 5

I am very happy :D

lebrice commented 2 years ago

Greeting @janvainer , thanks for posting!

This is the eternal thorn in my foot, and it's quite possible IMO that this issue of Argparse was one of the biggest motivations for other approaches like Hydra's.

I've worked on this kind of thing a LOT. There is no neat way of doing this in SP, and I don't quite see a clean way forward.

I implemented something like this in my project in a few lines but it would be nice if SP had built-in support.

Talk is cheap @idoby . Show me the code! :stuck_out_tongue:

As for your question @janvainer , the best solution I've found so far is to add every combination of nested subparsers. It sounds ugly, but it's actually pretty neat. The only thing you have to watch out for is the order in which the arguments are passed. It's also a bit buggy with lists and other types of fields that generate a value of nargs "*".

I called this feature add_independant_subparsers. There's a proof-of-concept for this in the lebrice/multiple_subparsers branch, which I'll probably make a PR for soon.

Here is an example:

from typing import ClassVar, Tuple
from simple_parsing.helpers.independent_subparsers import add_independent_subparsers

from dataclasses import dataclass
from simple_parsing import ArgumentParser
from pathlib import Path
import os

@dataclass
class DatasetConfig:
    """ Configuration options for the dataset. """

    image_size: ClassVar[Tuple[int, int, int]]

    # Number of samples to keep. (-1 to keep all samples).
    n_samples: int = -1
    # Wether to shuffle the dataset.
    shuffle: bool = True

@dataclass
class MnistConfig(DatasetConfig):
    """ Configuration options for the MNIST dataset. """

    image_size: ClassVar[Tuple[int, int, int]] = (28, 28, 1)
    n_samples: int = 10_000  # some random number just for sake of illustration.
    shuffle: bool = True
    foo: str = "foo_hey"

@dataclass
class ImageNetConfig(DatasetConfig):
    """ Configuration options for the ImageNet dataset. """

    image_size: ClassVar[Tuple[int, int, int]] = (28, 28, 1)
    n_samples: int = 10_000_000  # some random number just for sake of illustration.
    shuffle: bool = False
    # Path to the imagenet directory.
    path: Path = os.environ.get("IMAGENET_DIR", "data/imagenet")

@dataclass
class ModelConfig:
    """ Configuration options for the Model. """

    # Learning rate.
    lr: float = 3e-4

@dataclass
class SimpleCNNConfig(ModelConfig):
    """ Configuration options for a simple CNN model. """

    lr: float = 1e-3

@dataclass
class ResNetConfig(ModelConfig):
    """ Configuration options for the ResNet model. """

    lr: float = 1e-6

def main():
    parser = ArgumentParser(description=__doc__)
    add_independent_subparsers(
        parser,
        dataset={"mnist": MnistConfig, "imagenet": ImageNetConfig},
        model={"simple_cnn": SimpleCNNConfig, "resnet": ResNetConfig},
    )

    args = parser.parse_args()
    dataset_config: DatasetConfig = args.dataset
    model_config: ModelConfig = args.model

    print(f"Args: {args}")
    print(f"Dataset config: {dataset_config}")
    print(f"Model config: {model_config}")

$ python examples/multiple_subparsers/multiple_subparsers_example.py --help
usage: multiple_subparsers_example.py [-h] <dataset>|<model> ...

optional arguments:
  -h, --help         show this help message and exit

dataset or model:
  <dataset>|<model>
    mnist            Configuration options for the MNIST dataset. (help)
    imagenet         Configuration options for the ImageNet dataset. (help)
    simple_cnn       Configuration options for a simple CNN model. (help)
    resnet           Configuration options for the ResNet model. (help)

$ python examples/multiple_subparsers/multiple_subparsers_example.py mnist --help
usage: multiple_subparsers_example.py mnist [-h] [--n_samples int] [--shuffle bool] [--foo str] <model> ...

 Configuration options for the MNIST dataset. (desc.)

optional arguments:
  -h, --help       show this help message and exit

model:
  <model>
    simple_cnn     Configuration options for a simple CNN model. (help)
    resnet         Configuration options for the ResNet model. (help)

MnistConfig ['command_0']:
   Configuration options for the MNIST dataset. 

  --n_samples int  some random number just for sake of illustration. (default: 10000)
  --shuffle bool   (default: True)
  --foo str        (default: foo_hey)

$ python examples/multiple_subparsers/multiple_subparsers_example.py mnist resnet
Args: Namespace(dataset=MnistConfig(n_samples=10000, shuffle=True, foo='foo_hey'), model=ResNetConfig(lr=1e-06))
Dataset config: MnistConfig(n_samples=10000, shuffle=True, foo='foo_hey')
Model config: ResNetConfig(lr=1e-06)

Things to watch out for:

Ordering of arguments:

$ python examples/multiple_subparsers/multiple_subparsers_example.py mnist resnet --shuffle
usage: multiple_subparsers_example.py [-h] <dataset>|<model> ...
multiple_subparsers_example.py: error: unrecognized arguments: --shuffle

Optional arguments (e.g. bools)

$ python examples/multiple_subparsers/multiple_subparsers_example.py mnist --shuffle resnet
usage: multiple_subparsers_example.py mnist [-h] [--n_samples int] [--shuffle bool] [--foo str] <model> ...
multiple_subparsers_example.py mnist: error: argument --shuffle: Boolean value expected for argument, received 'resnet'

Here's the correct way to pass the shuffle argument for the dataset, for example:

$ python examples/multiple_subparsers/multiple_subparsers_example.py mnist --shuffle=True resnet
Args: Namespace(dataset=MnistConfig(n_samples=10000, shuffle=True, foo='foo_hey'), model=ResNetConfig(lr=1e-06))
Dataset config: MnistConfig(n_samples=10000, shuffle=True, foo='foo_hey')
Model config: ResNetConfig(lr=1e-06)

It's not a perfect solution, and there are probably bugs lurking in there, but it should hopefully be useful to you or to others, even in it's current form.

Hope this helps. If you want to use it, I suggest copying the contents of this function here, or waiting until I've merged this to master with a PR.

idoby commented 2 years ago

Well, it's a hack, but here goes.

I was testing out different optimizers and hyperparameters, and didn't like that my experiments had this configuration code in them, so I made this (details omitted for clarity):

def optimizer(model_params, train_config: TrainConfig) -> torch.optim.Optimizer:
    if train_config.optimizer_type.lower() == 'adam':
        adam_config = poly_config(AdamConfig, 'adam_')
        return torch.optim.Adam(model_params, lr=train_config.lr, weight_decay=train_config.weight_decay,
                                betas=adam_config.betas)
    elif ...:

train_config is a dataclass that includes the optimizer_type field that is read by SP from the command line, then poly_config injects the appropriate class and reparses the command line at runtime. There's nothing stopping me from making this more robust with automatic mappings etc. But this allows me to have a hierarchy of *Config types for each optimizer, where client code only ever sees the appropriate concrete type at runtime (e.g. AdamConfig, SGDConfig, etc). The line where I use this class to create a torch adam object is just to show the point of this.

The resulting command line is: python myscript.py --lr=0.0001 --optimizer_type=adam --adam_betas=0.9 0.9

If you don't supply adam_betas you will either get a default value or an error if there's no default value. Once I automate this, I will also be able to throw an error if any arguments are leftover when the whole parsing procedure is done (i.e the user specified --adam_betas in an invalid context like not having specified Adam first).

While it is a hack, doing hacks on top of argparse seems to be SP's job. However, since SP subclasses AP and provides the same interface, I'm not sure multiple parsing passes are possible to implement in SP without breaking AP's contract. I'll need to think about this some more.

On second thought, it seems better for SP to stick to simple parsing and let my library do all the hacks on top of that.

janvainer commented 2 years ago

@lebrice Thank you for the response! Much appreciated. Do you think there is a clean way how to leverage the Union typing to have the multiple subcommand options as fields in a parent config?

Ie

@dataclass
class Global:
    letter: Union[A, B] = A()
    number: Union[One, Two] = One()
    some_other_arg: int = 5

If not I'll simply use what you are suggesting, it does what I want in the end and is simple :) Any ETA on merging this?

Edit: after a second thought, it may be a bit confusing that there is missing a keyword for the subparser options - either script.py --dataset=mnist or script.py dataset mnist would be more informative than just script.py mnist. WDYT?

lebrice commented 2 years ago

Thanks for the response @idoby !

I've also used this kind of approach before (also for types of optimizers) and it worked well for a while, until I wasted a bunch of jobs in an HPO sweep because I misspelled an argument, and an error wasn't raised explicitly.

There are ways around this however. For instance, in your setup, If you had the poly_config function modify a global variable containing all the leftover arguments, taking away some arguments each time, then, at the point where all the args are supposed to have been parsed (e.g. once the models and optimizers have been created), you could raise an error if there are any leftover unused arguments.

Do you think there is a clean way how to leverage the Union typing to have the multiple subcommand options as fields in a parent config?

Sure, it would be possible to do, but it wouldn't be that easy. We'd need to collect all the subparser fields on all dataclasses at a given level (i.e. all the dataclasses passed to parser.add_arguments for the same parser) and then add them after all the other non-subparser arguments. I'm also not sure if this would need to be recursive, i.e. collect all the subparser fields of all subparser fields, etc.

it may be a bit confusing that there is missing a keyword for the subparser options - either script.py --dataset=mnist or script.py dataset mnist would be more informative than just script.py mnist

Agreed 100%, but it isn't possible to pass a subparser name as an argument in Argparse, as far as I can tell. If someone figures out a way to do this with regular argparse that isn't too hacky, I'd be very interested.

One idea could be to replace --model=resnet in the argv that we receive with just resnet and then pass it down to argparse. That would also be super hacky, but it might work?

Let me know what you think! :)

idoby commented 2 years ago

I've also used this kind of approach before (also for types of optimizers) and it worked well for a while, until I wasted a bunch of jobs in an HPO sweep because I misspelled an argument, and an error wasn't raised explicitly.

Yeah, that's why I want to raise the error explicitly and this requires a process that can keep track of the state across multiple parse operations, which my library is doing anyway (and SP isn't and IMO shouldn't).

I'll get around to this at some point, maybe after NeurIPS...

lebrice commented 2 years ago

By the way, I'm not opposed to adding this kind of mechanism directly in Simple-Parsing @idoby !

For instance, it could be an idea to not create argparse subparsers when using the dataclass Union syntax. (e.g. model: Union[ResNetConfig, VggConfig]) Instead, we could replace a model subparser with a --model argument, make parser.parse_args() call parser.parse_known_args() internally, and then do the kind of trick you mention directly when parsing the arguments, to resolve the type of each field, then parse the remaining args, etc. until all fields are resolved. Then, if there are still leftover arguments, we'd raise an error just like argparse would.

We'd just have to design it first, and make it transparent to the user as to what exactly is going on.

idoby commented 2 years ago

I think it would be nice, but I'm not sure it fits well with the argparse mental model. I do think that subparsers were not meant to do what SP is doing with them and this solution could be a better approach. Subparsers are meant for simple git-like commands, they are not a main feature of argparse.

lebrice commented 2 years ago

Made a discussion for this feature, with the pseudocode of how I plan to do this: https://github.com/lebrice/SimpleParsing/discussions/134

lebrice / SimpleParsing