Question: can default yaml be overridden inside other yamls?

vedal commented 1 year ago

Sorry for all the questions. Please feel free to ignore it.

I wonder if its possible to define, and then override/extend, yaml defaults inside other yamls. This is supported in Hydra through default configs.

The reason I'd like to have this is to have an entire experiment defined inside a yaml, both the default and "override" values. An alternative would be specifying several --config on the command line.

Example: Say I want to use most values from default_data_config.yaml but change batch_size. In Hydra, I'd write the following experiment config:

defaults:
  - data: default_data_config.yaml
  - _self_

data: 
  batch_size: 4

This way, data will be populated by defaults, and batch_size will be overridden. The "__self__" notation on the bottom of the list means the configs in the current file override defaults.

I noticed that referencing yamls inside other yamls is supported in jsonargparse, but could find anything about overrides.

mauvilsa commented 1 year ago

Sorry for all the questions. Please feel free to ignore it.

No worries. Ask as much as you need. Anyway the questions might be helpful for other people too. And might lead to improvements in the code or the documentation.

wonder if its possible to define, and then override/extend, yaml defaults inside other yamls.

Not possible in a single config file. There are several possibilities:

as you said multiple arguments --config=defaults.yaml --config=config.yaml
--data=data_defaults.yaml --data.batch_size=4
--data=data_defaults.yaml --data=data.yaml
add defaults.yaml to default_config_files and any override command line arguments without needing to specify the default config
etc.

Some time ago I started a branch to improve the documentation regarding the overrides but did not finish it. What I had written is:

Override order
--------------

Final parsed values depend on different sources, namely: source code, command
line arguments, :ref:`configuration-files` and :ref:`environment-variables`.
Values are overridden based on the following precedence:

1. Defaults defined in the source code.
2. Existing default config files in the order defined in
   ``default_config_files``, e.g. ``~/.config/myapp.yaml``.
3. Full config environment variable, e.g. ``APP_CONFIG``.
4. Individual key environment variables, e.g. ``APP_OPT1``.
5. Command line arguments in order left to right (might include config files).

Depending on the parse method used (see :class:`.ArgumentParser`) and how the
parser was built, some of the options above might not apply. Parsing of
environment variables must be explicitly enabled, except if using
:py:meth:`.ArgumentParser.parse_env`. If the parser does not have an
:class:`.ActionConfigFile` argument, then there is no parsing of a full config
environment variable or a way to provide a config file from command line.

Why do you want to have the input to the cli in a single config? What value does it bring?

In my view there isn't a difference with respect to specifying a couple of command line arguments. What I do find important is that once an experiment was run, know what had been run, i.e. automatic logging of the config like in https://pytorch-lightning.readthedocs.io/en/latest/cli/lightning_cli_advanced.html#automatic-save-of-config. But this should not depend on having a single input config file.

Currently jsonargparse.CLI does not provide a way to implement an automatic save of config. It can only be done by manually creating and running the parser.

mauvilsa commented 1 year ago

@vedal did my comment answer your questions? Can this be closed now?

vedal commented 1 year ago

@mauvilsa yes this definitely was a thorough answer to my question, and I appreciate it alot.

Why do you want to have the input to the cli in a single config? What value does it bring?

I did not answer your question however, as I needed to work a bit with the CLI to see how it would fit me.

In my current setup, I run exploratory experiments with different models and datasets. configs are divided as follows

default_config.yaml: defaults for both data, model and trainer.
test_config.yaml: standard overrides for testing
data1.yaml ...
dataN.yaml
model1.yaml ...
modelM.yaml

I also like to use the default Lightning folder structure for storing checkpoints, logs and hparams. For this, I need to override the logger experiment name in a config file (lets call it "experiment1.yaml") in the following way:

trainer:
  logger:
    init_args:
      name: experiment_name

along with some small changes to data_i and model_i, which could also go in experiment1.yaml

model:
  batch_size: 4
data:
  batch_size: 4

So, in the end, I'd end up using 4-5 configs for each experiment:

default
(test_config)
data_i
model_i
experiment1

The alternative I imagined, where I also keep track of which experiments are related (they have the same experiment yaml) is the following:

define default.yaml and test_config.yaml as defaults for the CLI (currently doable)
have only one experiment1.yaml look like the following:

trainer:
  logger:
    init_args:
      name: experiment_name

data: data_i.yaml
model: model_i.yaml

model:
  batch_size: 4
data:
  batch_size: 4

Phew, that was super-long and probably boring, I'm sorry about that. You're right that it can all be defined on the CLI (except probably the experiment name hack; its a pain!). Your suggestion might well be the simplest way to solve this, without making the mess that hydra-configs can become (with overrides left and right).

I think what I struggle with is the separation of an experiment_name from the dataset/model choice that went into it.

Maybe an option for keeping track of which parameters went into each experiment, without checking each output yaml individually, is to log them to tensorboard as hparams... I haven't tried that yet.

One question however: do you usually always print_config before every experiment, as a kind of "--dry-run" to check that all hparams are ok?

Anyway, thank you again for making the best config system i've been able to find.

function2-llx commented 1 year ago

Same here, really looking forward to a way that could "inherit" from another config file and modify the arguments, or even compose multiple config files inside a single config file (like the example of Hydra provided in this issue).

mauvilsa commented 1 year ago

If you are new to jsonargparse it is best if you familiarize yourself with its override-order. Just because in hydra something can be done in a certain way, it does not mean that the same should be implemented in jsonargparse. There is no point in adding yet another way to do something that already has alternatives. For a new feature to be added, a compelling motivation should be clear, and currently there isn't, in my view.

@function2-llx in your case, why must it be a single config that "inherits" other configs? The example in the description is the same as doing cli.py --data=data_defaults.yaml --data=data.yaml or the already mentioned alternatives. The point of a CLI is that you provide arguments to it. Not much reason why to limit yourself to a single config argument.

Also note that using a config for a group of settings inside another config is possible, e.g.

data: data_i.yaml
model: model_i.yaml

Though, without the need of that additional config, from command line it would be the same as

cli.py --data=data_i.yaml --model=model_i.yaml

mauvilsa commented 1 year ago

do you usually always print_config before every experiment, as a kind of "--dry-run" to check that all hparams are ok?

I do use --print_config extensively. Though mostly for debugging. Not every time before running a command. Though, this is partly because lately I mostly enable other people to run experiments, not me running them. I do check what other people do, but for that the automatic save of the config that LightningCLI has is enough. Anyway, my impression is that other people do commonly use print_config.

mauvilsa commented 5 months ago

@rusmux as explained here, multiple arguments is the alternative. From what I understand the only motivation is that sometimes it is inconvenient to use multiple arguments. This is a valid motivation. Though note that this feature is considerable complex which might make this motivation not enough.

omni-us / jsonargparse

Question: can default yaml be overridden inside other yamls? #221