spotify / luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache License 2.0
17.89k stars 2.4k forks source link

task complete with no inputs #11

Closed jcrobak closed 11 years ago

jcrobak commented 11 years ago

The definition for Task.complete looks like:

    def complete(self):
        """
            If the task has any outputs, return true if all outputs exists.
            Otherwise, return whether or not the task has run or not
        """
        outputs = flatten(self.output())
        if len(outputs) == 0:
            # TODO: unclear if tasks without outputs should always run or never run
            warnings.warn("Task %r without outputs has no custom complete() method" % self)
            return False

        for output in outputs:
            if not output.exists():
                return False
        else:
            return True

The docstring doesn't quite match the implementation. We have several tasks that it would be useful to have them run once per day, only if not yet run (e.g. hadoop fsck, cleanup jobs to gc old files, etc). It might be harder to track, but what do you think about adding that into luigi, or should I save state in HDFS or something like that?

erikbern commented 11 years ago

Not sure if we have a lot of jobs with no input and output, but we definitely have a lot of jobs with no real output, just as a way of invoking other jobs. We normally write a checkpoint in HDFS or on the local file system to mark it as done.

You need some way to save the state so I think that's the easiest, but of course you could also figure out something more elaborate like using a database to track it. As you can see in the code above, the default implementation is to check all outputs, but you could also override complete and just issue a query against a db.

Not sure what's best. In principle I guess you could rely on the scheduler to remember which tasks have been executed, but right now by design any worker can override it by just scheduling again. I think this is probably a reasonable thing to do since it makes testing easier (like if you want to re-run something).

erikbern commented 11 years ago

But yeah the docstring is definitely broken, we can fix it. Thanks!

jcrobak commented 11 years ago

Sorry for being inexact -- you understood my question correctly, though. I was really asking about no output jobs :)

I can go down the checkpoint in HDFS route, unless there's an easy way to tell mark something done in the server. I was actually thinking it might be useful to capture the output of the run and put it into the file that is generated, for history's sake.

erikbern commented 11 years ago

Sounds good. I think this is one of the Luigi "patterns" that we could point out in the introduction.