populse / capsul

Collaborative Analysis Platform : Simple, Unifying, Lean
Other
7 stars 14 forks source link

Process "definition" meaning and doc #212

Closed denisri closed 2 years ago

denisri commented 2 years ago

After several discussions I'm sorry, it's still not completely clear for me what the Process definition init parameter is, should be, and why it is mandatory. I think at least this parameter should be documented in a clear way. After recent changes it's not possible any longer to instantiate a pipeline using its class:

pipeline = capsul.executable(MyPipeline)

as it is used in many test cases, which was convenient in tests, in some local uses of pipelines, or to get the definition by the pipeline module/class automatically. The code explicitly throws an exception:

        if 'definition' not in kwargs:
            raise TypeError('No definition string given to Pipeline constructor')

I wonder why the pipeline module/class is not enough to get its "definition" automatically in most of the cases. Indeed in Process we don't have this limitation, as there is a default definition:

        if definition is None:
            defn = []
            if self.__class__.__module__ != '__main__':
                defn.append(self.__class__.__module__)
            defn.append(self.__class__.__name__)
            definition = '.'.join(defn)

This is not consistent: should we remove the default definition in Process, or allow it in Pipeline ? I vaguely understand that this definition is used for serialization in client/server execution contexts. But in local execution we may not need it, and for "regular" pipelines defined in available modules, and serialization could/should be separated from instantiation ? Given than in most of the cases the definition (tell me if I'm wrong) is just giving the module/class, should it really be always explicit ?

To understand, when should this "definition" be different from the module/class ? When a custom pipeline is serialized for remote execution, on a server which does not have its module/class definition ? But then it's the client/server communication which should serialize the exisiting pipeline instance as a json dict (which can be done automatically from the pipeline instance), but why should we pass this definition when instantiating the pipeline on client side first ?

To summarize: do we have to change all tests and uses of pipelines, or should we allow instantiating a pipeline by its class ?

sapetnioc commented 2 years ago

A Process cannot always guess what its definition is, this is why I made it mandatory in constructor. For instance, a process class defined in /tmp/my_process.py has the file name as definition, not the module and class name. To my opinion, definition should be mandatory in all constructors but could be made optional in executable(). If only a class is given, it would build definition based on class module and name before instantiation. But it should also check that module.class correspond to the given class and raise an error if not. Otherwise, the process could never be executed (for instance a Process class defined in a function). This is because there is always a process serialization (using definition and parameters serialization) in new API, even in local implementation since execution is done in a different Python than the client one.

denisri commented 2 years ago

A Process cannot always guess what its definition is, this is why I made it mandatory in constructor. For instance, a process class defined in /tmp/my_process.py has the file name as definition, not the module and class name

Yes, but this situation is the "less frequent" one, and I think we should make things as simple as possible for users/programmers, thus maybe making the definition optional for "regular" situations, and maybe doing something different for locally defined proceses. On the other hand, a Pipeline instance can always describe itself: either it is defined in a regular module/class, or it can "save" itself as a json dict/file, thus I guess the definition is never required for a pipeline. But I agree that we should mark pipelines that are "custom" and which should serialize as json rather than module/class.

To my opinion, definition should be mandatory in all constructors but could be made optional in executable()

OK then but so we have to modify executalbe() because right now it doesn't allow it.

But it should also check that module.class correspond to the given class and raise an error if not.

I agree with that. We could even perform this check in Process/Pipeline constructors, thus drop the mandatory state of the definition in constructors, but I'm OK if at least executable() works without it.

even in local implementation since execution is done in a different Python than the client one.

Yes but in this local situation, even a /tmp/my_process.py file would be enough, only a process built internally would not. But I'm OK not to allow this specific case, no problem.

Who's doing what, next ?

sapetnioc commented 2 years ago

I had in mind that instantiation of processes and pipelines would never been done by use (only via a method such as executable or custom_pipeline). Therefore I did not see a problem in making definition mandatory for constructors. But I may be wrong.

There is already a CustomPipeline class that is used for pipelines that are not a class in a Python code (dynamically created pipelines and pipelines read from JSON). There is a special custom_pipeline definition value that makes the serialization of the pipeline save its whole structure in JSON.

sapetnioc commented 2 years ago

I forgot to answer the last question. These are not big modification. I am working on iterations but I can stop to make it if necessary.

denisri commented 2 years ago

OK so let's make it optional in executable(). I can do that.

denisri commented 2 years ago

No don't stop, it's OK I'm on it.