populse / capsul

Collaborative Analysis Platform : Simple, Unifying, Lean
Other
7 stars 14 forks source link

Rethink Capsul configuration #145

Closed sapetnioc closed 2 years ago

sapetnioc commented 5 years ago

Project configuration must take into account several things that make it necessary to completely rethink Capsul configuration. One important feature of the future new system is the ability to define requirements in processes (a requirement such as "I need SPM in version 12"). Another feature that we would like to be possible (at least in the future) would be to make it possible to configure several versions of a software (e.g. SPM 8 and 12) and to allow pipelines to combine processes using different versions. This will become possible in v3.0 because each process will run in a separate job (with potentially a different configuration). But the configuration system and the process dependencies system must be designed specifically to allow such features.

sapetnioc commented 2 years ago

I did not noticed that this issue was closed.

denisri commented 2 years ago

A few thoughts / proposals for the configuration in Capsul v2 / v3:

In a general way we need to handle:

In the current implementation (v2):

A dict / json representation would be more handy, but using "just" a dictionary would be "too free", not allowing to define "official" variables, with types and documentation, and forbid mistakes such as setting a wrong key in the dict. The Controller API seems quite appropriate for it: it allows to declare, type, and document variables, it can be converted from / to dicts and thus saved in json format, has a generic GUI which allows to edit all parameters (even if dedicated GUIs would sometimes look better for end users but this is not forbidden). The only problem is the levels where several configs can exist: here the keys are not known in advance and should be open. But we can use the OpenKeyController at this level. In addition we may use the notification mechanisms to react when a config parameter is set or changed.

Thus the manipulation API would look like this:

settings.global.spm.spm12_standalone.executable = '/usr/local/spm12-standalone'
print(settings.export_to_dict())
>>> {'global': {'capsul.engine.module.spm': {
             'spm12_standalone': {'config_id': 'spm12_standalone',
                                  'config_environment': 'global',
                                  'directory': '/usr/local/spm12-standalone'}}}
del settings.spm.spm12_standalone
print(settings.export_to_dict())
>>> {"global': {'capsul.engine.module.spm': {}}}

This looks more similar to what did exist in the former StudyConfig, but with the missing levels (environment, several configurations...). In the example above, both global (environment) and spm12_standalone (config ID in the SPM module) are free variable names.

We may make this API work both in capsul v2 and v3.

sapetnioc commented 2 years ago

I totally agree that controllers are very good candidates for internal implementation. But we must define the whole configuration ecosystem, not only the configuration itself. For instance decide how and where modules are written, how to list available modules, how to combine several configurations, how to verify the validity of a configuration, etc. To list all the topic we must address, we must keep in mind the different potential user profiles of the configuration system. I am thinking of 4 profiles :

sapetnioc commented 2 years ago

Using the YAQL selection language can work with controllers transformed in dict. In the following code, I copied the data and query from the first example found in YAQL doc.

This may be an option to express a kind of selection of configuration. But this is not yet clear to me how to do it.

# -*- coding: utf-8 -*-

import yaql

from soma.controller import Controller
from soma.undefined import undefined

# Without dict iheritance, YAQL raises an error
class DictController(Controller, dict):
    def get(self, value, default=None):
        return getattr(self, value, default)

    def __getitem__(self, name):
        value = getattr(self, name, undefined)
        if value is undefined:
            raise KeyError(name)
        return value

    def items(self):
        for field in self.fields():
            yield (field.name, getattr(self, field.name))

    def values(self):
        for field in self.fields():
            yield getattr(self, field.name)

    def keys(self):
        for field in self.fields():
            yield field.name

    def __contains__(self, key):
        return self.field(key) is not None

    def __len__(self):
        return len(self.fields())

    def __iter__(self):
        for field in self.fields():
            yield field.name

class CustomerCity(DictController):
    city: str
    customer_id: int

class Order(DictController):
    order_id: int
    item: str
    quantity: int

class Customer(DictController):
    customer_id: int
    name: str
    orders: list[Order]

class DB(DictController):
    customers_city: list[CustomerCity]
    customers: list[Customer]

json_db = {'customers_city': [{'city': 'New York', 'customer_id': 1},
  {'city': 'Saint Louis', 'customer_id': 2},
  {'city': 'Mountain View', 'customer_id': 3}],
 'customers': [{'customer_id': 1,
   'name': 'John',
   'orders': [{'order_id': 1, 'item': 'Guitar', 'quantity': 1}]},
  {'customer_id': 2,
   'name': 'Paul',
   'orders': [{'order_id': 2, 'item': 'Banjo', 'quantity': 2},
    {'order_id': 3, 'item': 'Piano', 'quantity': 1}]},
  {'customer_id': 3,
   'name': 'Diana',
   'orders': [{'order_id': 4, 'item': 'Drums', 'quantity': 1}]}]}

db = DB()
db.import_json(json_db)

engine = yaql.factory.YaqlFactory().create()

expression = engine(
    '$.customers.orders.selectMany($.where($.order_id = 4))')

order = expression.evaluate(data=db)
print(order)
denisri commented 2 years ago

In the model we discussed offline, we would base the configuration on Contoller, and dict[str, Controller]. However we face a validation problem: pydantic typechecking is too weak, and basically inoperant for dicts (or lists):

from soma.controller import Controller
c = Controller()
c.add_field('toto', dict[str, Controller], default_factory=dict)
c.toto[12] = 'babar'

does not complain in any manner, whereas neither the dict key nor the value have the required types. Pydantric only validates direct affectations, when the full dictionary is set:

c.toto = {12: 'babar'}

actually raises a ValidationError. But in our multiple level configuration, we never assign the full dict at once. Thus we have no validation at all, and we can set any garbage in the config.

denisri commented 2 years ago

Otherwise it's possible to use OpenKeyController instead of dicts, which offers a stronger validation.

sapetnioc commented 2 years ago

I have a question about configuration API. First, let's assess I understand the current API. Configuration is a hierarchy:

Now I need to store configuration for datasets. This should be a kind of dict[str, DatasetConfig] where DatasetConfig would contain, at least, a directory name and a organization schema name (i.e. 'bids-1.1' or 'brainvisa-5.0'). A dataset configuration is tied to a computing resource (validity of the path) but cannot have several version such as modules. Is EngineConfiguration the right place to put this configutation ? If yes, could it be implemented by simply adding a field dataset of type OpenKeyController[DatasetConfig] ?

denisri commented 2 years ago

Yes, that's it. I this context, wouldn't it be just a matter of implementing DatasetConfig as a config module (ModuleConfiguration subclass) ? I think this would just match the requirements you mention: the EngineConfiguration, tied to a computing resource, would have a field dataset, which would automatically be exactly an OpenKeyController[DatasetConfig].

You can now look "visually" at the config structure using this code (in ipython --gui=qt):

from capsul.qt_gui.widgets import config_gui
from soma.qt_gui.controller import ControllerWidget
from capsul.config import ApplicationConfiguration

c = ApplicationConfiguration('blop')
cw = ControllerWidget(c)
cw.show()
sapetnioc commented 2 years ago

Yes, you are right. I missed this possibility because it put in the same bag what seemed to me to be different concepts:

But this is only a matter of internal documentation, user documentation and GUI. What is important is to fix the configuration access API and using a dataset configuration module is the easiest way to provide the right API.

denisri commented 2 years ago

Yes sure, but we have not considered separating different aspects of configuration, and I'm not sure it is worth the complication: even for users, they would need to look at several places to enter their configuration, if we do, thus it is simpler as it is done now. Moreover there are likely more than 2 "concepts": software, datasets, but also possibly computing resource configuration, completion, etc. Right now I don't feel a strong need to separate them, but if we do we will have to add a config level, or categories.