pydoit / doit

CLI task management & automation tool
http://pydoit.org
MIT License
1.87k stars 175 forks source link

clean not able to access doit data? #288

Closed moltob closed 5 years ago

moltob commented 5 years ago

Actions are able to access doit's database, e.g. the return values of other tasks via getargs. As such configuration affects the action and what is done and produced, I see frequent need to also provide this to clean. But it does not seem to be possible. Is it?

Thx!

Upvote & Fund

Fund with Polar

schettino72 commented 5 years ago

clean wont be able to access getargs because clean does not follow normal execution task's dependencies... but i understand your problem. You just need to save the data in the task that will use it.

Can you provide a simple example? I guess this would be a nice addition to (planned) HOW-TO section in docs.

moltob commented 5 years ago
import doit.tools

DOIT_CONFIG = dict(
    verbosity=2,
    default_tasks=['create_object'],
)

def task_determine_identifier():
    """Determine an identifier to be used by other tasks."""

    def do_complex_stuff():
        # Do some very complex or lengthy operation, like talking to an external service.
        ...
        return dict(identifier='0815')

    return dict(
        actions=[do_complex_stuff],
    )

def task_create_object():
    """Create object with ID."""

    # We don't know the identifier here (during task creation).

    def action(ident):
        print('Created object with ID', ident)

    def clean():
        # How to get identifier here?
        print('Deleted object with ID ???')

    return dict(
        actions=[action],
        clean=[clean],
        getargs=dict(ident=('determine_identifier', 'identifier')),
    )

The second task needs values from the first task to define details of the action and also details of the corresponding clean.

schettino72 commented 5 years ago

Your example is pretty much what I expected. :smile:

Running your example I found a bug on result_dep (used implicitly by getargs): fixed https://github.com/pydoit/doit/commit/3873759424a5a9559d7b6c8660bd72e0091bb0fc

I thought it was possible but actually currently not possible to do your use-case without changes on doit.

The problem is that dep_manager needs to be somehow exposed to task's action execution. We need to define an API... For uptodate functions, it can receive a magic values parameter that doit will inject saved values. We could use same solution here or maybe something more generic where whole dep_manager is exposed. What do you think? (Not sure you are familiar enough with doit internals)

I got it working with a quick and dirty patch...

diff --git a/doit/cmd_clean.py b/doit/cmd_clean.py
index 00d8d93..82845eb 100644
--- a/doit/cmd_clean.py
+++ b/doit/cmd_clean.py
@@ -57,6 +57,7 @@ class Clean(DoitCmdBase):
         for task in tasks:
             if task.name not in cleaned:
                 cleaned.add(task.name)
+                task.dep_manager = self.dep_manager
                 task.clean(self.outstream, dryrun)
                 if forget_tasks:
                     self.dep_manager.remove(task.name)
DOIT_CONFIG = dict(
    verbosity=2,
    default_tasks=['create_object'],
)

def task_determine_identifier():
    """Determine an identifier to be used by other tasks."""

    def do_complex_stuff():
        # Do some very complex or lengthy operation, like talking to an external service.
        ...
        return dict(identifier='0815')

    return dict(
        actions=[do_complex_stuff],
    )

def task_create_object():
    """Create object with ID."""

    # We don't know the identifier here (during task creation).

    def action(ident):
        print('Created object with ID', ident)
        # save value here
        return {'ident': ident}

    def clean(task):
        # How to get identifier here?

        # FIXME: this line requires a patch on `clean` command
        #        to expose dep_manager
        values = task.dep_manager.get_values(task.name)

        # Solution: get value saved
        print('Deleted object with ID {}'.format(values.get('ident')))

        # For getargs, you would not need to save avlue again
        dep_result = values.get('_result:determine_identifier', {})
        print('Deleted object with ID {}'.format(dep_result.get('identifier')))

    return dict(
        actions=[action],
        getargs=dict(ident=('determine_identifier', 'identifier')),
        clean=[clean],
    )
moltob commented 5 years ago

I do like the approach where the dep_manager is available to a task. Do you think it would make sense to not do this for clean only, but also for actions, just to be a bit less clean-specific? Every now and then I run into situations, where I'd like to get some data from the doit data base, also in actions and having access to it would be great.

What is kind of dangerous is to expose a mutable version of it, as users might start messing around and update doit state data on their own. There are various ways to prevent that (pass a copy, pass a frozen object, ...).

If you don't like adding the dep_manager to task, an alternative API could be to pass it into the action/clean call as optional argument, just like task or dependencies.

schettino72 commented 5 years ago

Thinking over the subject... I agree about making dep_manager generally available would be a good idea. Like it or not, the truth is that it is a global. So lets treat it like that. So I would suggest adding doit.get_dep_manager(). That would be available everywhere... even when loading tasks, and would make it possible to save data that is not specific to tasks...

I dont care about it being dangerous, following python philosophy, hope you dont shoot yourself in the foot.

But while giving user the power to use anywhere to do advanced stuff I would like to keep doit (mostly) no-api design. Given a magic values parameters to clean the same as it is available on uptodate() would solve your original use-case, and would be simpler for a user (does not have not learn DepManager API).

Actually I guess having a magic values for all actions would solve another problem that one task can use computed values from another task but not values saved in another action on its own task.

So in conclusion I propose 2 changes:

1) adding a global get_dep_manager() 2) supporting a magic values parameter to actions (including clean)

Would be great if you could work on a patch for any of them. Both changes should be straightforward to implement.

moltob commented 5 years ago

In fact your first suggestion is even more attractive, as I also have the use case where during task generation I already have to know some previously computed values.

I'll try suggesting a PR for that throughout the next couple of days.

moltob commented 5 years ago

I looked at the source and providing a global doit.get_dep_manager seems to require some refactoring to prevent code duplication, so I want to run this by @schettino72 to get some guidance first.

The idea would be to lazily call code like this from the new free function:

https://github.com/pydoit/doit/blob/0f7c219845f6b1ee36393d24ea613754dca4d1a5/doit/cmd_base.py#L470-L471

But here I need access to various elements to be able to call the class:

The methods above then require more configuration parameters and infrastructure elements like the plugin dictionary (does it?).

It looks like much construction logic required to instantiate the Dependency class is currently woven into the command classes. So I am wondering about your preferred way of reusing that logic. I see some options:

The cleanest (to me) seems to be the second option but it also has the biggest impact and risk of things going wrong with this change. The first approach looks ok and code duplication is prevented as well.

Please let me know how you would like it done.

schettino72 commented 5 years ago

Thanks. Always good to discuss design first...

When I mention the dependency manger being a global, I mean that it would not be instatiated in any other place but on Command execution. So the implementation is just like a registry...

Quick hack:

--- a/doit/__init__.py
+++ b/doit/__init__.py
@@ -40,4 +41,8 @@ def get_initial_workdir():
     """working-directory from where the doit command was invoked on shell"""
     return loader.initial_workdir

+
+GLOBALS = {}
+def get_dep_manager():
+    return GLOBALS.get('dep_manager')
+
--- a/doit/cmd_base.py
+++ b/doit/cmd_base.py
@@ -469,6 +469,8 @@ class DoitCmdBase(Command):
             # dep_manager might have been already set (used on unit-test)
             self.dep_manager = Dependency(
                 db_class, params['dep_file'], checker_cls)
+        import doit
+        doit.GLOBALS['dep_manager'] = self.dep_manager

This would be enough for your original use-case. But the DepManger would not be available when executing task creators.

As we discussed it would be nice if this is also available when loading tasks. This will require changes in TaskLoader as it currently has a single function load_tasks() that returns both config values and task instances. We need to modify it so that it first only gets the config values. Then on DoitCmdBase we would create an instance of DepManager. Then finally actually load the tasks...

So my suggestion of what exactly needs to be done is:

Hope it is clear.

moltob commented 5 years ago

Sorry for not responding for a while, I was offline due to German vacation time. 😃

I think I understood your suggestion. Essentially you want to split these lines:

https://github.com/pydoit/doit/blob/0f7c219845f6b1ee36393d24ea613754dca4d1a5/doit/cmd_base.py#L449-L450

  1. First get the dodo_config.
  2. Update params dict from it.
  3. Instantiate dep_manager.
  4. Store it in a new (correct?) global registry dictionary. This is a new doit concept.
  5. Load the real tasks, now that dep_manager is available in the registry.

Please let me know in particular about adding the registry concept to doit. I'll do it in a clean way without circular dependencies but I just want to make sure you agree to this new way of storing stuff for user tasks.

moltob commented 5 years ago

Looking into the split, I have yet another question. TaskLoader.load_tasks() might get some cleanup on the way.

https://github.com/pydoit/doit/blob/0f7c219845f6b1ee36393d24ea613754dca4d1a5/doit/cmd_base.py#L301-L310

Since the TaskLoader._load_from() method loads tasks and configuration, the resulting call sequence has slightly confusing method names: SomeConcreteTaskLoader.load_tasks() -> TaskLoader._load_from() -> loader.load_tasks(), loader.load_doit_config().

TaskLoader._load_from() must do a switch on the type of namespace, even though that namespace is passed in by the derived classes, which already know precisely, what this is. IMHO the code is not taking full advantage of the class hierarchy.

I suggest an initial refactoring like this:

This will pave the road for the updated call sequence in the command base class. If you agree, I'd do the refactoring and test adaptation first in a separate PR.

Please let me know.

moltob commented 5 years ago

Looking further at the docs and sample code, it seems that you might want to keep the original signature:

class SomeLoader(TaskLoader):
    def load_tasks(...) -> Tuple[List[Task], Dict[str, Any]]:
        ...

If that is the case, the change is much smaller. I'll stand by for your comments now.

moltob commented 5 years ago

Final comment: I think your suggestion to split config and task loading is not quite consistent with keeping the TaskLoader.load_tasks() API in place with the aforementioned return values. I believe we can't have both...

schettino72 commented 5 years ago

Open-source is volunteer work, no worries about time.

registry

Store it in a new (correct?) global registry dictionary. This is a new doit concept.

Just be clear. I mentioned "registry" as a concept... Do not need to create a class for that, a plain dict can do the job as in my quick patch.

Actually doit_cmd.get_var() has a similar implementation. Note the special care for windows multi-processing.

TaskLoader API

old

new API


To keep backward compability

if hasattr(self.loader, 'load_doit_config'):
   # new api
   members = self.loader.get_members()
   config = self.loader.load_doit_config()
   ... # other stuff
   dep_manger = Dependency()
   tasks = self.loader.load_tasks()
else:
   ... # like current code
   # of course in this backward compatible dep_manager wont be available
   # while loading tasks. I just dont want to break code when someone upgrades doit

Maybe we could be more explicit how to trigger old/new api. Specially because load_tasks is present on both but with different signatures... Or maybe get a new name instead of load_tasks? load_tasks_only() You can decide what you think works beter.

moltob commented 5 years ago

Ok, thanks for the feedback. Since you are open for smaller API changes, how do you like the following design, where I:

class TaskLoader:
    """Interface of task loaders."""

    cmd_options = ()

    def __init__(self):
        # list of command names, used to detect clash of task names and commands
        self.cmd_names = []
        self.config = None  # reference to config object taken from Command

    def load_doit_config(self, opt_values):
        """Load doit configuration."""
        raise NotImplementedError()

    def load_tasks(self, cmd, opt_values, pos_args):
        """Load tasks.

        :return: (List[Task])
        :param cmd: (doit.cmd_base.Command) current command being executed
        :param opt_values: (dict) with values for cmd_options
        :param pos_args: (list str) positional arguments from command line
        """
        return NotImplementedError()

class NamespaceLoader(TaskLoader):
    """Load tasks from namespace, mapping identifiers to items.

    ``namespace`` must be given at creation time or set before ``load_tasks`` or
    ``load_doit_config`` are called.
    """

    def __init__(self, namespace=None):
        super().__init__()
        self.namespace = namespace

    def load_doit_config(self, opt_values):
        return loader.load_doit_config(self.namespace)

    def load_tasks(self, cmd, opt_values, pos_args):
        return loader.load_tasks(self.namespace, self.cmd_names, cmd.execute_tasks)

class ModuleLoader(NamespaceLoader):
    """Load tasks from a Python module."""

    def __init__(self, module):
        super().__init__(dict(inspect.getmembers(module)))

class DodoFileLoader(NamespaceLoader):
    """Delayed loading of tasks from a dodo file."""

    def _ensure_namespace(self, opt_values):
        """Lazily load namespace from dodo file per config parameters."""

        if self.namespace is None:
            module = loader.get_module(
                opt_values['dodoFile'],
                opt_values['cwdPath'],
                opt_values['seek_file'],
            )
            self.namespace = dict(inspect.getmembers(module))

    def load_doit_config(self, opt_values):
        self._ensure_namespace(opt_values)
        return super().load_doit_config(opt_values)

    def load_tasks(self, cmd, opt_values, pos_args):
        self._ensure_namespace(opt_values)
        return super().load_tasks(cmd, opt_values, pos_args)

The fact that opt_values is an argument to the API functions is not super-nice, as here the dodo-file specific implementation is leaking into the abstraction. However, I am not sure it is worth now to spend the extra effort.

Haven't run this yet, so consider it pseudo-code, but would you agree to this design?

schettino72 commented 5 years ago

The fact that opt_values is an argument to the API functions is not super-nice, as here the dodo-file specific implementation is leaking into the abstraction.

opt_values is not specific to dodo-file. Any TaskLoader (implemented as a plugin) can take extra parameters from command-line arguments...

By the way, pos_args is there because you could potentially implement a more efficient loading mechanism, but I never got to implement that.

Looking back, maybe I would pass opt_values in __init__ or some other initialization method. But I guess not really worth to change it.


I guess NamespaceTaskLoader is not necessary as their methods are one line. Just add one more layer for no benefit.

Can you rename the new implementationTaskLoader to BaseTaskLoader and keep the old one for backward compatibility?

schettino72 commented 5 years ago

I guess I was not clear enough. If opt_values could be passed on some kind of initialiazation method, then NamespaceTaskLoader would make sense, but right now its use is too limited. If you want to add a third method for initialization I would be ok with that.

Changing __init__ would be harder as it is called by plugin system...

moltob commented 5 years ago

Regarding the necessity of NamespaceTaskLoader: Where do you suggest to integrate its code? Folding it into TaskLoader makes assumptions about TaskLoaders that for instance the ones used in the tests do not fulfill, since they are not at all namespace based. If I would implement a task loader from a database, I might not want to have the namespace code and member clutter my class.

But of course, it's possible and no big flaw. Let me know if you want to do it that way.

schettino72 commented 5 years ago

I guess this is implementation detail. you can make a call...

moltob commented 5 years ago

My point was that the reason to distinguish between TaskLoader and NamespaceLoader is already shown in your test code: Not all task loaders are namespace based.

But this is probably not be worth struggling over. I'll collapse and get rid of the NamespaceTaskLoader.

Thanks for the comment regarding plugins: This means any base class to be meaningfully reused by user-code must not add additional arguments to __init__. I was slightly misled by your existing specializations, which do have additional arguments like mod_dict. But of course those are rather internal implementations, which are not meant to be used as base-class for more, plugin-based loaders. In Java speak they would be final...

moltob commented 5 years ago

How about this:

TaskLoader not shown, it's your original class.

class TaskLoader2(TaskLoader):
    """Interface of task loaders with new-style API.

    The default implementation assumes tasks are loaded from a namespace, mapping identifiers to
    elements like functions (like task generators) or constants (like configuration values).

    This API update separates the loading of the configuration and the loading of the actual tasks,
    which enables additional elements to be available during task creation.
    """

    def __init__(self):
        super().__init__()

        # namespace to be searched for loadable objects, must be set by derived classes before the
        # load methods are called:
        self.namespace = None

    def load_doit_config(self, opt_values):
        """Load doit configuration."""
        return loader.load_doit_config(self.namespace)

    def load_tasks(self, cmd, opt_values, pos_args):
        """Load tasks.

        :return: (List[Task])
        :param cmd: (doit.cmd_base.Command) current command being executed
        :param opt_values: (dict) with values for cmd_options
        :param pos_args: (list str) positional arguments from command line
        """
        return loader.load_tasks(self.namespace, self.cmd_names, cmd.execute_tasks)

class ModuleTaskLoader(TaskLoader2):
    """load tasks from a module/dictionary containing task generators
    Usage: `ModuleTaskLoader(my_module)` or `ModuleTaskLoader(globals())`
    """
    def __init__(self, mod_dict):
        super(ModuleTaskLoader, self).__init__()
        if inspect.ismodule(mod_dict):
            self.namespace = dict(inspect.getmembers(mod_dict))
        else:
            self.namespace = mod_dict

class DodoTaskLoader(TaskLoader2):
    """default task-loader create tasks from a dodo.py file"""
    cmd_options = (opt_dodo, opt_cwd, opt_seek_file)

    def _ensure_namespace(self, opt_values):
        """Lazily load namespace from dodo file per config parameters."""

        if self.namespace is None:
            module = loader.get_module(
                opt_values['dodoFile'],
                opt_values['cwdPath'],
                opt_values['seek_file'],
            )
            self.namespace = dict(inspect.getmembers(module))

    def load_doit_config(self, opt_values):
        self._ensure_namespace(opt_values)
        return super().load_doit_config(opt_values)

    def load_tasks(self, cmd, opt_values, pos_args):
        self._ensure_namespace(opt_values)
        return super().load_tasks(cmd, opt_values, pos_args)

Sorry for querying your feedback so frequently, but since this is not a straight forward bugfix, I don't want to create surprises by adding personally preferred patterns that might not fit well. Once we agree on this, the implementation is almost trivial (to write and to review).

schettino72 commented 5 years ago

Adding an extra setup() method. What do you think?


class TaskLoader2():
    """Interface of task loaders with new-style API.

    This API update separates the loading of the configuration and the loading of the actual tasks,
    which enables additional elements to be available during task creation.
    """

    API = 2

    def setup(self, opt_values):
        pass

    def load_doit_config(self):
        """Load doit configuration."""

    def load_tasks(self, cmd, pos_args):
        """Load tasks.

        :return: (List[Task])
        :param cmd: (doit.cmd_base.Command) current command being executed
        :param pos_args: (list str) positional arguments from command line
        """

class NamespaceTaskLoader(TaskLoader2):
    """
    The default implementation assumes tasks are loaded from a namespace, mapping identifiers to
    elements like functions (like task generators) or constants (like configuration values).
    """

    def setup(self, opt_values):
        self.namespace = None
        raise NotImplementedError()

    def load_doit_config(self):
        """Load doit configuration."""
        return loader.load_doit_config(self.namespace)

    def load_tasks(self, cmd, pos_args):
        """Load tasks.

        :return: (List[Task])
        :param cmd: (doit.cmd_base.Command) current command being executed
        :param opt_values: (dict) with values for cmd_options
        :param pos_args: (list str) positional arguments from command line
        """
        return loader.load_tasks(self.namespace, self.cmd_names, cmd.execute_tasks)

class ModuleTaskLoader(NamespaceTaskLoader):
    """load tasks from a module/dictionary containing task generators
    Usage: `ModuleTaskLoader(my_module)` or `ModuleTaskLoader(globals())`
    """
    def __init__(self, mod_dict):
        super(ModuleTaskLoader, self).__init__()
        self.mod_dict = mod_dict

    def setup(self, opt_values):
        if inspect.ismodule(mod_dict):
            self.namespace = dict(inspect.getmembers(self.mod_dict))
        else:
            self.namespace = self.mod_dict

class DodoTaskLoader(NamespaceTaskLoader):
    """default task-loader create tasks from a dodo.py file"""
    cmd_options = (opt_dodo, opt_cwd, opt_seek_file)

    def setup(self, opt_values):
        """Lazily load namespace from dodo file per config parameters."""

        if self.namespace = loader.get_module(
                opt_values['dodoFile'],
                opt_values['cwdPath'],
                opt_values['seek_file'],
            )
moltob commented 5 years ago

Works for me. I would not add the API attribute though. I can simply test for isinstance(loader, TaskLoader2). This way I don't need to add API = 1 to the original class or check hasattr first.

moltob commented 5 years ago

I still have more suggested changes to your last version, but I think we're OK now with the general design, so I'll turn to code now. :-)