hoedt / upsilonconf

A simple configuration library
MIT License
5 stars 1 forks source link

Support for lists in `to_dict` #7

Closed salotz closed 1 year ago

salotz commented 1 year ago

Currently from_dict works to recursively go down through lists to get sub FrozenConfiguration objects. By example:

>>> conf = FrozenConfiguration.from_dict({"things": [    {"a": 0, "b": 1},     {"a": 2, "b": 3}, ]})
FrozenConfiguration(things=(FrozenConfiguration(a=0, b=1), FrozenConfiguration(a=2, b=3)))

>>> conf.to_dict()
{'things': (FrozenConfiguration(a=0, b=1), FrozenConfiguration(a=2, b=3))}

And then I cannot serialize this to e.g. YAML simply.

hoedt commented 1 year ago

FrozenConfiguration.from_dict converts dicts to FrozenConfiguration objects because it is the easiest way to get immutable mappings. If you are just interested in conveniently writing these to YAML, you can just use upsilonconf.save_config with a filename that ends with ".yaml" or ".yml".

This being said, the main problem is that upsilonconf does not really support sequences of configurations at this point. The main reason is because I want to reserve them for representing configuration grids (e.g. for hyper-parameter search). However, if you have a specific need for lists of configs, this is something I am open for and can look into. If this is the case, I would appreciate it if you could start a new issue and explain your use case for sequences of sub-configs.

PS: To be precise, this is a particular side-effect of FrozenConfiguration.from_dict. If you would use PlainConfiguration.from_dict, you'd find that the dicts in your list are not recursively converted (because a list of dict is not considered part of the configuration currently).

salotz commented 1 year ago

I see, kind of. In my use case they would be YAML files written by hand.

The main reason is because I want to reserve them for representing configuration grids (e.g. for hyper-parameter search).

IMO that is a feature that should be built on a lower level general purpose config system (which I thought was the goal of upsilonconf and one reason I'm not using e.g. hydra on top of OmegaConf). And I can use another hyperparameter matrix tool for that like ml_collections. E.g. a special schema of a config file that that can specify matrices of parameters that can then be used to generate specific full configs.

salotz commented 1 year ago

So that said I'm not sure its another issue, except that it competes with the other goals you've mentioned here.

hoedt commented 1 year ago

my use case they would be YAML files written by hand.

Can you be a bit more specific about what you want to achieve by using lists? After all, even if the config is written by hand, it would be possible to just add a key for each list item (e.g. item0, item1, ...). Do you wish to represent configuration options (similar to what we need for hyper-parameter searches), is it something like *args vs **kwargs or is it something entirely different?

E.g. a special schema of a config file that can specify matrices of parameters

It seems that you are interested in using lists to represent different options for configurations. Is this correct?

that is a feature that should be built on a lower level general purpose config system

You are probably right. Even in the case of hyper-parameter search, the representation of configuration options should probably happen in this library.

I'm not sure it's another issue

You are right (again). I completely missed the title and thought this was about saving configuration objects (rather than full support for list values).


Given my current understanding of your issue, I think the following tweaks should solve the problem:

Is this something you could work with?

salotz commented 1 year ago

I should clarify as I realized I seemed to be contradictory in my comments. What I am doing is round-tripping the configuration. A user might write the below config but I want to output at runtime a copy of the final resolved configuration which may contain defaults, application version, etc. That is where I run into this issue.

After all, even if the config is written by hand, it would be possible to just add a key for each list item (e.g. item0, item1, ...).

This is what I am doing currently. Basically in the case where you wouldn't be able to give these sections keyed names, or they are inherently anonymous things. For an example you are parametrizing something that needs a list of things but doesn't need names for them:

rocks:
  - size: 10
    weight: 30
  - size: 3
    weight: 5

I don't care what the name of the rocks are but I do need to allow for the user to specify an arbitrary number of them.

I would expect behavior from to_dict to output the following (or another method -- e.g. destructure -- if you want to keep this one within a particular use case although I think it would be unnecessarily confusing):

{"rocks" : [{"size" : 10, "weight" : 30}, {"size" : 3, "weight" : 5}]}

So currently the workaround for my particular use case isn't so bad to name each "rock" but I think its something that should definitely be supported as it seems like an unnecessary constraint on the ergonomics of want to provide to my users.


Given my current understanding of your issue, I think the following tweaks should solve the problem:

Pardon me if I'm misinterpreting or misjudging the purpose of the library, but I think you are overfitting the concepts in the solution to that of parameter searches still. My list items are not "options" they are just lists of things. An "option" could be an interpretation, but ultimately higher level. I'm thinking of something much simpler that is just converting to and from raw "dicts and lists" to "Configuration" objects where the "Configuration" objects have support for merging (the access syntax is nice too I guess). This is all I really care about. I'm doing the YAML parsing/writing myself (for subtle issues like handling scientific notation properly) and I'll be handling the structuring of the values to real Python types (e.g. https://catt.rs/en/stable/index.html).

That said the hyper-parameter framework (like hydra) is something I am interested in. I've used hydra before and my only real complaint is the ANTLR dependency, as I often have to build everything from source and having a JDK in the mix makes that really annoying, especially for what should ultimately be a very simple pure-python library. I'd be happy to use something that replaces it built on upsilon, but not what I'm interested in here. I see the Upsilon/Omegaconf layer as a very nice unbloated unix-y tool that can be used in so many other places!

I also would be willing to contribute as time allows and if anything becomes a real blocker. Especially if the vision in the docs is maintained as I would probably end up writing my own version of it as there is currently nothing filling the niche.

Hopefully that isn't too much unsolicited advice

hoedt commented 1 year ago

Thank you for the additional information. That does make things a little clearer.

Just to make sure I understand you correctly, are the following statements correct?

Also, would you be able to answer the following questions:

This final question probably requires some background information. Configuration objects are designed to conveniently represent kwargs of python functions. After all, kwargs are typically used to "configure" functions in python. Given your example of the rocks config, I would assume that there is some function of the form def func(x, y, z, rocks) and this function can be called using func(x, y, z, rocks=config.rocks) or func(x, y, z, **config). Is the goal to call some sub_func with the different configurations in the list or is the body of func agnostic of the kind of elements that are passed through the rocks kwarg?

Sorry for the interrogative tone, but I just want to make sure that I understand you and your use-case correctly.

PS: I just realised that round-trips on mutable objects inside of an immutable configuration object would require to keep track of the original (mutable) types in the configuration. I am not sure if that kind of overhead is reasonable. What do you think?

salotz commented 1 year ago

Sorry for the interrogative tone, but I just want to make sure that I understand you and your use-case correctly.

No worries! Happy to answer any questions.

Just to make sure I understand you correctly, are the following statements correct? the items in the list don't represent configuration options, you don't need any special operations on these list values, you do need an immutable configuration object. you do need round-trip functionality for from_dict, to_dict.

Yes to all.

should the dicts inside the list be represented as configuration objects?

Yes

how do you plan to use these items in your code?

A simple but illustrative example:

rocks = RockArray([Rock(**rock) for rock in config.rocks])

In the large and for converting to structured objects I would use a system like cattrs (linked above), which kind of leads into the background info.

This final question probably requires some background information. Configuration objects are designed to conveniently represent kwargs of python functions. After all, kwargs are typically used to "configure" functions in python.

You also have the *args you can pass as well.

Given your example of the rocks config, I would assume that there is some function of the form def func(x, y, z, rocks) and this function can be called using func(x, y, z, rocks=config.rocks) or func(x, y, z, **config). Is the goal to call some sub_func with the different configurations in the list or is the body of func agnostic of the kind of elements that are passed through the rocks kwarg?

I wouldn't do this directly. There would be two possibilities:

  1. The one I showed above where you are doing some structuring of the config into classes (in my case attrs "dataclasses"). You can do this manually (as above) or use some kind of system that recursively does the type casting for you (cattrs in my case).
  2. Functions that just accept more dynamic "dicts and lists".

For instance:

rocks = config.to_dict()["rocks"]

# would look like this, not list of Configurations
# [{"size" : 10, "weight" : 30}, {"size" : 3, "weight" : 5}]

def dig_rock(size, weight):
    ...

def dig_rocks(rocks)

    for rock in rocks:
        dig_rock(**rock)

dig_rocks(rocks)

I think as your design goals states I also think its a bad idea to have the Configuration objects passed into anything other than application entrypoint code. Unless you are at a leaf of the configuration that has no sub-structure, in which case its fine to func(**config["rocks"][0]).

I think to summarize is that I think it should kind of be left up to the user of the library to figure out how to do the slicing/structuring of the data in the configuration object. I'm picking up that you were kind of intending to use dict and tuple unpacking as kind of an all-purpose structuring tool, however I don't think that works at scale since it doesn't work recursively itself. That is why I use a tool like cattrs which is specially built to handle this difficult issue.

For instance consider I want to change my weight attributes to unitted values that needs to be specially parsed:

rocks:
  - size: 10
    weight: 30 kg
  - size: 3
    weight: 5 kg

I would need to change to something like:

rocks = [{"size" : rock["size"], "weight" : parse_quantity(rock["weight"]) for rock in config.to_dict()["rocks"]

And this complexity just keeps on scaling linearly with every new config option you add.

My only point is that cleanly handling deeply nested serialized structures like configs is probably something beyond the scope of (in my mind) a simple config merging library. The only general thing you can do is to just recursively convert to lists and dicts and if the user's use case is suitable for tuple/dict unpacking that should flow naturally from that.


PS: I just realised that round-trips on mutable objects inside of an immutable configuration object would require to keep track of the original (mutable) types in the configuration. I am not sure if that kind of overhead is reasonable. What do you think?

Personally I would just rip out anything mutable at all from the library. Its an antipattern and as you are mentioning it makes useful functional things impossible. I plan on never using anything other than FrozenConfiguration.

So I would just consider input "lists" (or tuples) as snapshotted at the point of config creation and forget the original thing existed ever.

salotz commented 1 year ago

An additional example with using cattrs for structuring and my ideal setup:

from attrs import define

@define
class Rock:
    size: int
    weight: int

@define
class RockArray:
    rocks: list[Rock]

rocks = CONVERTER.structure(RockArray, config.to_dict())

Where CONVERTER is a potentially customized cattrs converter, which can handle things like parsing special fields like in the units example above.

And then for destructuring you can do to plain lists and dicts:

rock_d = CONVERTER.destructure(rocks)

Or to things like serialization formats where you may want to handle format specific things like binary encodings etc.:

rock_json = JSON_CONVERTER.destructure(rocks)

Not trying to sell you on it necessarily but I think it might be the missing piece in the conceptual puzzle of what the boundaries of the config library can and should be, as it relates to this discussion.

hoedt commented 1 year ago

I think I understand what you want to achieve, but I still don't quite understand how this is related to configuration or why you would need a configuration class.

The use case you illustrate seems to be more about how to convert user input to structured data (which seems to be a problem that is addressed by cattrs) than about the configuration of a program. I still fail to see why it could be a good idea to have a list of configurations as a configuration value unless they are just different sub-configuration options.

Given your example, I would expect something like

with open("config.json") as fp:
    rocks = CONVERTER.structure(RockArray, json.load(fp))

to do exactly what you want. I do not see the benefit of using an intermediary configuration class in this setting.

I would just rip out anything mutable at all from the library

Making collections in python immutable comes at a certain cost: all elements in the collection must be immutable as well. If you would create the frozen configuration from immutable objects, round-trips are no problem, but since you create the frozen configuration from mutable objects, the objects need to be converted and the round-trip becomes impossible (without significant overhead).

salotz commented 1 year ago

I think I understand what you want to achieve, but I still don't quite understand how this is related to configuration or why you would need a configuration class.

I see your point. The main thing I use the configuration class for is the merging operations.

So:

with open("config.json") as fp:
    config = FrozenConfiguration.from_dict(json.load(fp))

final_config = config | defaults

rocks = CONVERTER.structure(RockArray, final_config.to_dict())
salotz commented 1 year ago

Making collections in python immutable comes at a certain cost: all elements in the collection must be immutable as well. If you would create the frozen configuration from immutable objects, round-trips are no problem, but since you create the frozen configuration from mutable objects, the objects need to be converted and the round-trip becomes impossible (without significant overhead).

I'm fine with it being structurally the same. I.e. if you input a tuple from the config and output a list that is fine with me, but I can see why you wouldn't want to make that choice.

hoedt commented 1 year ago

Since Python 3.9, you can merge dictionaries in python natively (PEP-584). If that is not an option for you, I would advise you to use PlainConfiguration, because this will not interfere with the incoming types and therefore natively enables round-trips.

I hope this anticlimactic answer works for you.

PS: feel free to fork this project to build an input parser if that is what you need, but I'm afraid that this is not what I had in mind when building upsilonconf.

salotz commented 1 year ago

Seems obvious in hindsight, as I hadn't actually used this in practice as of yet...

I would like to have richer operations than just what is in the PEP though, which I was thinking of adding additional feature requests. For instance merge([config_a, config_b, config_c]) and subtree merge.

Fair enough though thanks for the conversation and sorry for taking so long to get there.