Rethink Capsul configuration

sapetnioc commented 5 years ago

Project configuration must take into account several things that make it necessary to completely rethink Capsul configuration. One important feature of the future new system is the ability to define requirements in processes (a requirement such as "I need SPM in version 12"). Another feature that we would like to be possible (at least in the future) would be to make it possible to configure several versions of a software (e.g. SPM 8 and 12) and to allow pipelines to combine processes using different versions. This will become possible in v3.0 because each process will run in a separate job (with potentially a different configuration). But the configuration system and the process dependencies system must be designed specifically to allow such features.

sapetnioc commented 2 years ago

I did not noticed that this issue was closed.

denisri commented 2 years ago

A few thoughts / proposals for the configuration in Capsul v2 / v3:

In a general way we need to handle:

"environments", in other words computing resources (local or remote). We can consider using a default environment, named "global", when not specified, and have shortcuts / dedicated GUI for it.
extensible modules, for different software such as SPM, FSL etc, or different aspects (completion, databasing...)
several configuratioons for a given module may coexist: for instances several versions of a software (SPM12/SPM8) may be configured
a requirements / selection mechanism allows to select the correct configs set for use to run a given process

In the current implementation (v2):

in Capsul v2, CapsulEngine / Settings rely on a database (populse_db) but apart from storage considerations, this is an "implementation detail" and the database API is not really exposed to the user / programmer.
in v2, the settings API is a bit too complex: 1. we need to use with.. statements every time we use/set config options, which makes it several lines of code for each use. 2. settings configs may already exist or not, and this needs to be questioned each time we access them, and accessed differently in accordance:
```
engine = capsul_engine()
with engine.settings() as session:
  config = session.config('spm', 'global', 'spm12-standalone')
  if config is None:
      session.new_config('spm', 'global', {'config_id': 'spm12-standalone',
           'directory': something})
  else:
      config.directory = something
```
6 lines of code just to set one parameter.

A dict / json representation would be more handy, but using "just" a dictionary would be "too free", not allowing to define "official" variables, with types and documentation, and forbid mistakes such as setting a wrong key in the dict. The Controller API seems quite appropriate for it: it allows to declare, type, and document variables, it can be converted from / to dicts and thus saved in json format, has a generic GUI which allows to edit all parameters (even if dedicated GUIs would sometimes look better for end users but this is not forbidden). The only problem is the levels where several configs can exist: here the keys are not known in advance and should be open. But we can use the OpenKeyController at this level. In addition we may use the notification mechanisms to react when a config parameter is set or changed.

Thus the manipulation API would look like this:

settings.global.spm.spm12_standalone.executable = '/usr/local/spm12-standalone'
print(settings.export_to_dict())
>>> {'global': {'capsul.engine.module.spm': {
             'spm12_standalone': {'config_id': 'spm12_standalone',
                                  'config_environment': 'global',
                                  'directory': '/usr/local/spm12-standalone'}}}
del settings.spm.spm12_standalone
print(settings.export_to_dict())
>>> {"global': {'capsul.engine.module.spm': {}}}

This looks more similar to what did exist in the former StudyConfig, but with the missing levels (environment, several configurations...). In the example above, both global (environment) and spm12_standalone (config ID in the SPM module) are free variable names.

We may make this API work both in capsul v2 and v3.

sapetnioc commented 2 years ago

I totally agree that controllers are very good candidates for internal implementation. But we must define the whole configuration ecosystem, not only the configuration itself. For instance decide how and where modules are written, how to list available modules, how to combine several configurations, how to verify the validity of a configuration, etc. To list all the topic we must address, we must keep in mind the different potential user profiles of the configuration system. I am thinking of 4 profiles :

The final user that will mainly (only ?) interact with configuration through a GUI
The configuration GUI developer that must have access to global information (e.g. list of available engines or modules)
The application developer that will need a bit of customization of the configuration. For instance mia and catidb could use the same system. They may share some configurations (SPM for instance) but will also need a specific part.
The software administrator that install and maintain application for users. He will need to be able to configure the software for many users.

sapetnioc commented 2 years ago

Using the YAQL selection language can work with controllers transformed in dict. In the following code, I copied the data and query from the first example found in YAQL doc.

This may be an option to express a kind of selection of configuration. But this is not yet clear to me how to do it.

# -*- coding: utf-8 -*-

import yaql

from soma.controller import Controller
from soma.undefined import undefined

# Without dict iheritance, YAQL raises an error
class DictController(Controller, dict):
    def get(self, value, default=None):
        return getattr(self, value, default)

    def __getitem__(self, name):
        value = getattr(self, name, undefined)
        if value is undefined:
            raise KeyError(name)
        return value

    def items(self):
        for field in self.fields():
            yield (field.name, getattr(self, field.name))

    def values(self):
        for field in self.fields():
            yield getattr(self, field.name)

    def keys(self):
        for field in self.fields():
            yield field.name

    def __contains__(self, key):
        return self.field(key) is not None

    def __len__(self):
        return len(self.fields())

    def __iter__(self):
        for field in self.fields():
            yield field.name

class CustomerCity(DictController):
    city: str
    customer_id: int

class Order(DictController):
    order_id: int
    item: str
    quantity: int

class Customer(DictController):
    customer_id: int
    name: str
    orders: list[Order]

class DB(DictController):
    customers_city: list[CustomerCity]
    customers: list[Customer]

json_db = {'customers_city': [{'city': 'New York', 'customer_id': 1},
  {'city': 'Saint Louis', 'customer_id': 2},
  {'city': 'Mountain View', 'customer_id': 3}],
 'customers': [{'customer_id': 1,
   'name': 'John',
   'orders': [{'order_id': 1, 'item': 'Guitar', 'quantity': 1}]},
  {'customer_id': 2,
   'name': 'Paul',
   'orders': [{'order_id': 2, 'item': 'Banjo', 'quantity': 2},
    {'order_id': 3, 'item': 'Piano', 'quantity': 1}]},
  {'customer_id': 3,
   'name': 'Diana',
   'orders': [{'order_id': 4, 'item': 'Drums', 'quantity': 1}]}]}

db = DB()
db.import_json(json_db)

engine = yaql.factory.YaqlFactory().create()

expression = engine(
    '$.customers.orders.selectMany($.where($.order_id = 4))')

order = expression.evaluate(data=db)
print(order)

denisri commented 2 years ago

In the model we discussed offline, we would base the configuration on Contoller, and dict[str, Controller]. However we face a validation problem: pydantic typechecking is too weak, and basically inoperant for dicts (or lists):

from soma.controller import Controller
c = Controller()
c.add_field('toto', dict[str, Controller], default_factory=dict)
c.toto[12] = 'babar'

does not complain in any manner, whereas neither the dict key nor the value have the required types. Pydantric only validates direct affectations, when the full dictionary is set:

c.toto = {12: 'babar'}

actually raises a ValidationError. But in our multiple level configuration, we never assign the full dict at once. Thus we have no validation at all, and we can set any garbage in the config.

denisri commented 2 years ago

Otherwise it's possible to use OpenKeyController instead of dicts, which offers a stronger validation.

sapetnioc commented 2 years ago

I have a question about configuration API. First, let's assess I understand the current API. Configuration is a hierarchy:

Level zero, represented by ConfigurationLayer class, corresponds to the storage location of the configuration (user, application and site). These configurations are merged before use, therefore this level may be ignored by user.
First level (EngineConfiguration class) corresponds to a specific computing resource. To date, all fields of an EngineConfiguration must correspond to a module name (e.g. spm, fsl, etc.)
Second level (EngineConfiguration attributes) can contain several configurations for a module (via OpenKeyController class). Each configuration must be given a name to access it internally.
Third level is an instance of ModuleConfiguration derived class specific to the corresponding module. This third level allows to configure several versions of the same module (e.g. spm 8 and spm 12)

Now I need to store configuration for datasets. This should be a kind of dict[str, DatasetConfig] where DatasetConfig would contain, at least, a directory name and a organization schema name (i.e. 'bids-1.1' or 'brainvisa-5.0'). A dataset configuration is tied to a computing resource (validity of the path) but cannot have several version such as modules. Is EngineConfiguration the right place to put this configutation ? If yes, could it be implemented by simply adding a field dataset of type OpenKeyController[DatasetConfig] ?

denisri commented 2 years ago

Yes, that's it. I this context, wouldn't it be just a matter of implementing DatasetConfig as a config module (ModuleConfiguration subclass) ? I think this would just match the requirements you mention: the EngineConfiguration, tied to a computing resource, would have a field dataset, which would automatically be exactly an OpenKeyController[DatasetConfig].

You can now look "visually" at the config structure using this code (in ipython --gui=qt):

from capsul.qt_gui.widgets import config_gui
from soma.qt_gui.controller import ControllerWidget
from capsul.config import ApplicationConfiguration

c = ApplicationConfiguration('blop')
cw = ControllerWidget(c)
cw.show()

sapetnioc commented 2 years ago

Yes, you are right. I missed this possibility because it put in the same bag what seemed to me to be different concepts:

The configurations of various release of a software
The configuration of various directories

But this is only a matter of internal documentation, user documentation and GUI. What is important is to fix the configuration access API and using a dataset configuration module is the easiest way to provide the right API.

denisri commented 2 years ago

Yes sure, but we have not considered separating different aspects of configuration, and I'm not sure it is worth the complication: even for users, they would need to look at several places to enter their configuration, if we do, thus it is simpler as it is done now. Moreover there are likely more than 2 "concepts": software, datasets, but also possibly computing resource configuration, completion, etc. Right now I don't feel a strong need to separate them, but if we do we will have to add a config level, or categories.

populse / capsul

Rethink Capsul configuration #145