Closed jcrobak closed 11 years ago
Not sure if we have a lot of jobs with no input and output, but we definitely have a lot of jobs with no real output, just as a way of invoking other jobs. We normally write a checkpoint in HDFS or on the local file system to mark it as done.
You need some way to save the state so I think that's the easiest, but of course you could also figure out something more elaborate like using a database to track it. As you can see in the code above, the default implementation is to check all outputs, but you could also override complete and just issue a query against a db.
Not sure what's best. In principle I guess you could rely on the scheduler to remember which tasks have been executed, but right now by design any worker can override it by just scheduling again. I think this is probably a reasonable thing to do since it makes testing easier (like if you want to re-run something).
But yeah the docstring is definitely broken, we can fix it. Thanks!
Sorry for being inexact -- you understood my question correctly, though. I was really asking about no output jobs :)
I can go down the checkpoint in HDFS route, unless there's an easy way to tell mark something done in the server. I was actually thinking it might be useful to capture the output of the run and put it into the file that is generated, for history's sake.
Sounds good. I think this is one of the Luigi "patterns" that we could point out in the introduction.
The definition for Task.complete looks like:
The docstring doesn't quite match the implementation. We have several tasks that it would be useful to have them run once per day, only if not yet run (e.g. hadoop fsck, cleanup jobs to gc old files, etc). It might be harder to track, but what do you think about adding that into luigi, or should I save state in HDFS or something like that?