omni-us / jsonargparse

Implement minimal boilerplate CLIs derived from type hints and parse from command line, config files and environment variables
https://jsonargparse.readthedocs.io
MIT License
320 stars 47 forks source link

Allow partial updates of nested dict values #236

Closed odedbd closed 1 year ago

odedbd commented 1 year ago

🚀 Feature request

Add support for partial update of nested dict values, instead of the current behavior which is to override Dict values completely.

Motivation

I'm using jsonargparse as part of LightningCLI. My code is for research model training, which means that it changes a lot. Before starting to use jsonargparse, I would have my DataModule and LightningModule accept several configuration parameters as dictionaries, with some nesting structure.

Switching to jsongargparse, I want to be able to keep this pattern and allow partial override of nested values by using multiple configuration files. I might have a main config with most of the parameters, and then add a debug config with just a few overrides of things that need to change for local debugging runs.

I managed to reproduce the behavior I wanted by switching all the dictionaries into dataclasses, but the end result was too cumbersome to advocate for my team to adopt. It's nice to be able to get command line help and default values configs, but it isn't necessary for everything we do, and the cost of setting up nested classes with the additional class_path/init_args that start to fill the config files is a bit too much for our needs (although I totally see the benefits for larger groups or mor production oriented groups).

I hope this might make sense, I will be happy to further discuss. I trully appreciate all the great work being put into jsonargparse and the integration with LightningCLI.

Pitch

I suggest allowing partial overrides, such that merging the namespaces of two configurations will not completely override the dictionary, instead only keys that were changed or added will be overridden.

I didn't prepare a full pull request, but I did a minimal hackish implementation of the intended behavior:

from typing import Union, Any, Optional, Dict

from jsonargparse import Namespace
from jsonargparse.namespace import dict_to_namespace

def update(self, value: Union['Namespace', Any], key: Optional[str] = None,
           only_unset: bool = False) -> 'Namespace':
    """Sets or replaces all items from the given nested namespace.

    Args:
        value: A namespace to update multiple values or other type to set in a single key.
        key: Branch key where to set the value. Required if value is not namespace.
        only_unset: Whether to only set the value if not set in namespace.
    """
    if not isinstance(value, Namespace):
        if not key:
            raise KeyError('Key is required if value not a Namespace.')
        if not only_unset or key not in self:
            if key not in self or value is not None:
                self[key] = value
            elif value is None:
                del self[key]
    else:
        prefix = key + '.' if key else ''
        for key, val in value.items():
            if not only_unset or prefix + key not in self:
                self.update(val, prefix + key)
    return self

Namespace.update = update

n1 = dict_to_namespace({'key1': 'val1', 'key2': {'key3': 'val3', 'key4': 'val4'}})
print(n1)
# Namespace(key1='val1', key2=Namespace(key3='val3', key4='val4'))

n2 = dict_to_namespace({'key1': None, 'key2': {'key4': 'val6', 'key5': 'val7'}})
print(n2)
# Namespace(key1=None, key2=Namespace(key4='val6', key5='val7'))

print(n1.update(n2))
# Namespace(key2=Namespace(key3='val3', key4='val6', key5='val7'))

n1 = dict_to_namespace({'key1': 'val1', 'key2': {'key3': 'val3', 'key4': 'val4'}})
print(n2.update(n1))
# Namespace(key1='val1', key2=Namespace(key3='val3', key4='val4', key5='val7'))

Note that to allow deleting a key via overriding I hijacked the None value such that an override of None are interpreted as deleting the keys, rather than setting the to None. this is an optional bevaior and can be removed. I preferred this since dict access using .get will result in None regardless, so there is a safe pattern for using this.

Alternatives

A partial alternative (no pun intended) is to allow using dataclasses as config values without specifying class_path/init_args as part of the config. If the parameter has a class type hint, assume that the properties nested under it are the init_args for that class. This could be a behavior controlled by a flag for those who wish to relax the checks such as myself. This will still require deifining new class for each nested dictionary within my params, but will not make the configuration files too complex with added boilerplate and nesting levels

mauvilsa commented 1 year ago

allow using dataclasses as config values without specifying class_path/init_args

dataclasses don't work as subclasses, so they don't require class_path/init_args. Has it been necessary for you in some case?

Add support for partial update of nested dict values, instead of the current behavior which is to override Dict values completely.

Update of dictionaries can be added, but not like you suggest. The behavior of completely replacing the value when the type is dict will not change because it would be a breaking change and it would become inconsistent with other types, e.g. list. What could be done is something similar to the list append. Would be something like:

>>> parser.add_argument('--dict', type=Dict[str, int])  
>>> parser.parse_args(['--dict={"a": 1, "b": 2}', '--dict+={"c": 3, "a": 4}'])
Namespace(dict={"a": 4, "b": 2, "c": 3})

To update from a config file, the key with + suffix would be used, analogous to how list append works now. Note that this would not work for nested dictionaries.

Also I think for consistency, setting a dict entry to None would not delete the entry. Anyway this does not change what you said since ".get will result in None regardless".

Some comments about your code snippet. Namespaces are not aware of parameter types, so the change would not be in the Namespace class but in the parsing code. You are free to do something like that in your own project, but likely it will have other unexpected consequences.

odedbd commented 1 year ago

Thanks for the detailed explanation. I really like the idea of using + as an append symbol for dictionaries. Why wouldn't it be possible to make it work with nested dictionaries though?

Regarding dataclasses, I think it might be the way that I am (ab)using them. Since I am adapting an existing codebase which heavily relies on the dictionary API, I created a base class, which subclasses the dict class and patched in support for the dict API which works with the fields of the dataclasses. I then use the @dataclass decorator on sublasses of that base class. I don't actually have to subclass dict for this functionality, but if I don't, and omegaconf is installed, I get an error using these, which subclassing dict resolved. I am guessing perhaps since I am subclassing a base class these are no longer pure dataclasses, so the code might be treating them differently.

Here is a shorthand of what I implemented:

class BaseParams(collections.abc.Mapping, dict):
    def __getitem__(self, item):
        return self.__dict__.__getitem__(item)

    def __iter__(self):
        return self.__dict__.__iter__()

    def __len__(self):
        return self.__dict__.__len__()

    def get(self, key, default=None):
        val = self[key]
        return val if val is not None else default

@dataclass
class MyParams(BaseParams):
  f1:str=''
  f2:Optional[int]=None

my_params1 = MyParams(f2=5,f1='as')
my_params2 = MyParams(f1='ds')

# because of the dict patching I can now do
f2_1=my_params1.get('f2')
f2_2=my_params2.get('f2', f2_1)
print(f2_2)
# 5
# to do the same without the .get convenience method I can do
f2_2 = my_params2.f2 if my_params2.f2 is not None else f2_1

I will try modifying one of my datalcasses to pure dataclasses and see if I don't need the class_path/init_args for that anymore.

I know this is pretty hacky, but I'm trying to adapt a non-trivial codebase and hoped for some way to not have to rewrite its every nook and cranny.

odedbd commented 1 year ago

UPDATE: I verified that using pure dataclasses indeed works fine without class_path/init_args. Sorry for my confusion there. I also realized I can modify my code to use dataclasses.asdict in my __init__ functions to convert the datadlasses to dictionaries, which will allow the downstream code to keep using the dict API. I will have to implement a simple utility get method to handle the None properties, but that's easy to do.

With this in mind, using (pure) dataclasses is a viable solution for my usecase. I would have preferred the + for dict solution; but if it cannot support nested dictionaries it will not be useable for us.

odedbd commented 1 year ago

Another UPDATE: Seems I was a little too quick to say it works as expected with pure dataclasses. I found one edge cases - when a dataclass parameter is nested inside another dataclass, and it is defined as an Optional field, it requires providing class_path/init_args.

A concrete example -

@dataclass
class Nested:
  f1: str = ''
  f2: int = 1

@dataclass
class Parent:
  fn: Optional[Nested] = None

In the above setup, for the config.yaml for Parent it would have to be

fn:
  class_path: Nested
  init_args:
    f2: 3

If I change Parent making Nested non-Optional, class_path/init_args are no longer needed. If I use the above configuration and do not add class_path/init_args, I get an error like this -

Does not validate against any of the Union subtypes
  Subtypes: (<class 'Nested'>, <class 'NoneType'>)
  Errors:
    - Not a valid Nested
    Subclass types expect one of:
    - a class path (str)
    - a dict with class_path entry
    - a dict without class_path but with init_args entry (class path given previously)
    - Expected a <class 'NoneType'>
  Given value type: <class 'dict'>
odedbd commented 1 year ago

@mauvilsa should I open a separate issue regarding the Optional nested dataclass behavior I described in my previous comment, or is this the intended behavior?

Also, should I open a new feature request for the + support for dictionaries? I'm still not sure what wouldn't work in the case of nested dictionaries. Is it only that the + sign can only be specified on the top level, and not for dictionaries nested within? or that a + on the top level would mean that first-level keys which aren't in the + config are not removed, but those keys which are will be completely replaced with the new values, rather than recursively updated?

thanks again for your help with this

mauvilsa commented 1 year ago

should I open a new feature request for the + support for dictionaries?

No. This issue already covers this.

Why wouldn't it be possible to make it work with nested dictionaries though?

Technically it would be possible, but it goes against the overall design that jsonargparse follows related to types. The idea is that types are interpreted just as what they mean in python, not be specific to jsonargparse. When a type dict is used, it means a dictionary that can have any key and value. In a command line argument --dict+ the dict part is the name of the parameter in a signature and it is not possible to have + in the name of a parameter. However, if you have --dict+={"key+": {"subkey": 1}} or --dict.key+={"subkey": 1}, is "key+" a key whose text is key+ or is it key with append. If it is always interpreted as append, then it becomes impossible to have keys that end with +.

To not conflict with what the type means, the behavior would be: for some parameter with name param, the argument --param+={"a": 1, "b": 2} would result in the code running param_dict.update({"a": 1, "b": 2}), where param_dict is the variable holding the previous value of param.

should I open a separate issue regarding the Optional nested dataclass

Yes, you can create a separate issue if you like. I did have this in a long list of TODOs that I have.

odedbd commented 1 year ago

Thanks for the explanation regarding th +dict functionality, makes a lot of sense. Now I that I understand why specifying + won't work for nested dictionaries, there is still one thing I would like to suggest.

For some param set with --param={"a": 1, "b": {"c": 2", "d": 3}} --param+={"b": 2} would, if I understand correctly, result in param_dict={"a":1, "b": 2}, which makes sense, the value of "b" is overwritten by the update and the nested dict is replaced by a numeric value.

If, instead, we have --param={"a": 1, "b": {"c": 2", "d": 3}} --param+={"b": {"e": 4, "d": 5}}, the result according to my understanding would be param_dict={"a":1, "b": {"e": 4, "d": 5}}.

What I would have found more useful in that case, would be for the result to be param_dict={"a":1, "b": {"c": 2", "d": 5, "e": 4}}, so b.c isn't modified, b.d is updated and b.e is appended.

It is also possible to assign a different symbol for this type of dict append, I would suggest bitwise or-like |= but the pipe might not play nice with command line pipes. Perhaps --param++= is a possibility? This would keep backward compatibility while allowing more flexibility in using "ad-hoc" arguments through dictionaries for an experimentation phase, before switching to dataclasses becomes cost effective for long-term support.

This suggestiong does not modify the definition of "a dictionary that can have any key and value", but does go beyond the native dict.update.

What do you think?

mauvilsa commented 1 year ago

What I would have found more useful in that case, would be for the result to be param_dict={"a":1, "b": {"c": 2", "d": 5, "e": 4}}, so b.c isn't modified, b.d is updated and b.e is appended.

This might be biased by your current need. Also saying in the documentation that += for dicts is the same as dict.update is easy explain and understand. A different behavior would more likely be unexpected to someone new using it if they hadn't read the documentation. Also I don't like that if there is a special behavior for nested dicts, would there be a special behavior for nested lists? And dicts nested in a nested list? etc.

It is also possible to assign a different symbol for this type of dict append, I would suggest bitwise or-like |= but the pipe might not play nice with command line pipes. Perhaps --param++= is a possibility? This would keep backward compatibility while allowing more flexibility in using "ad-hoc" arguments through dictionaries for an experimentation phase, before switching to dataclasses becomes cost effective for long-term support.

I will avoid rushing to add more syntax. Way before this, support for dataclasses nested in types will be implemented (e.g. Optional[Dataclass]) and could be argued why not use dataclasses instead. Also before adding more syntax the option to customize += could be made available in case some wants something different than dict.update.

Better support for dataclasses will be priority in any case. And I don't think the dict update should be considered as a feature just to use as an experimental phase before going for dataclasses.

odedbd commented 1 year ago

@mauvilsa thanks for your comments. I guess my preferenaces could be biased by my use case. The only reason I see why dicts could be different than lists is that a similar handling of nested lists is not possible without strict assumptions on the order of elements, since the identity of an element is only defined by its ordinal position in the list. For dicts the existence of a specific key is well defined regardless.

I do think support for nested dataclasses is a good solution, and I expect this is what we will be using. The only downside is that it does require more upfront investment in code for adding a new feature, compared to only adding a new key to the config file. In a strongly research oriented environment this may be unwanted overhead, but I guess for any other environment this is actually a plus, as the configuration is better defined.

Bottom line, I will be using dataclasses and see how well it works for our group. Thanks again for taking the time to address my comments!