Closed derekchiang closed 8 years ago
@jdoliner 1.0 worthy?
I think so, this actually breaks a number of very common jobs... for example it breaks the fruit stand. So would be good to get this in.
@jdoliner the only solution I came up with was to have job-shim
scan the content of /pfs
and if there are no files (in the case of a reduce job) or if all files are empty (in the case of a map job), then instead of running the user-provided cmd
, we simply report that the job has succeeded and exit. But that does't seem particularly elegant. Thoughts?
Yup here's how I'd do it. First we should extend InspectCommit
to take a FilterShard
parameter which it'll pass to each of the sub calls to inspectFile
that it does.
Then in pps.APIServer
we can do the following
Shard:
for shards := request.Shards; shards > 0; shards-- {
for i := 0; i < shards; i++ {
if InspectCommit(commit, FilterShard{Number:i, Modulus: shards}).Size == 0 {
continue Shard
}
}
}
It just occurred to me that there is a class of pipelines that might be ok with empty inputs. For instance, in my 50GB wordcount use case, I have a pipeline whose input repo simply serves as a "trigger". What I do is that I make an empty commit to the input repo, which will kickstart the pipeline, which then starts downloading the 50GB of input data from the internet and putting it into pfs.
So in general, there is a class of pipelines who are really "workers" that don't necessarily operate on any inputs. If we were to get rid of jobs with empty inputs, this class of pipelines won't work.
Maybe there can be a flag in the pipeline spec that specifies whether the pipeline is ok with empty input? If the flag is set, then we always respect the shards
flag even if some shards might be started without any input data.
@jdoliner thoughts?
I don't think we want to be encouraging empty commits or "trigger" pipelines as a hacky way to get around users needing to manually start pipelines or have them triggered by cron.
@JoeyZwicker I thought the only way to start a pipeline was to add a commit to the input repo. What did you mean by "manually starting pipelines"? How would you trigger a pipeline by cron?
Currently this is true, but #235 exists because I don't think we want this to be the case for users asap
There are a few open discussions/issues around adding a run pipeline
command ("running a pipeline manually") and creating a pipeline with no input repos that instead gets run based on some timing parameter, something like cron.
@derekchiang while this is not the current behaviour, one use for an empty commit is overall providence tracking.
Lets take a system that has some number of source repos and then it fans out through the processing DAG. No say I want to gather together all the results of a given commit to the source repos. How would that be done if the result of that commit only made it partially down the DAG. I would either have to: a) wait until a commit comes along which propagates to the end of the DAG b) poll pachyderm periodically for the state and extract the data c) inject code into each pipeline step to report results
If however all pipeline steps executed, but with some containing no new data (or zero data). Then it is easy. I simply set up a stage at the end of the pipeline which then scans over it. In this case it will trigger every time there is a new input commit.
This also starts to collapse things and make the commits more one to one.
I suspect we're going to see a lot of interesting uses of empty commits. Over time we'll migrate the common use cases into full fledged features. It seems like we already have several of them, in summary:
As we come up with more people should feel free to turn them into full fledged issues. Until we feel confident that empty commits truly aren't useful anymore we'll allow them. But long term it's probably going to make sense to try to push to disallow them since it'll issues like this a little more obvious.
For the time being I think we should offer the following semantics:
Job
has no actual data in any of its inputs we run it with the number of shards requested.Job
does have data we guarantee that each container will see data from at least 1 of its inputsDo those semantics seem reasonable to people?
I think this mostly makes sense, but I feel like it should be more explicit. Why don't we just have a flag in the pipeline spec that specifies whether the pipeline is ok with empty inputs?
I'm also in favor of having a run-pipeline
command. Having a "trigger" repo does not seem particularly elegant.
Basically, I see run-pipeline
being used for pipelines that have external effects (e.g. start a DDOS attack) or are non-deterministic (web scrapper).
Letting users be explicit about what they want is something I'm generally in favor of because it prevents them from inadvertently doing something they don't want to do. However there is a downside, by making people be explicit you make it a little bit harder for them to do what they want to do. Specifically with a flag for empty inputs I think we'd have 2 impacts:
Of those 2 the latter worries me a lot more. Because in reality we're not protecting people from anything that bad. If they have a pipeline that can't handle empty commits and they make an empty commit then it'll error which will hopefully make it clear to the user what was going on. But the latter case seems like it could really leave someone confused and frustrated for a while.
An alternative idea:
It seems like the real worry with the first case is that commits might unexpectedly be empty. Imagine a service thats being repaired 1 day and didn't emit any logs. If we're worried about that case maybe we should make users be explicit when they commit that they know the commit is empty. Ie for an empty commit you'd get the following user experience:
$ pachctl finish-commit foo bar
Error: commit foo/bar is empty (add --allow-empty to commit)
$ pachctl finish-commit foo bar --allow-empty
Discussed offline with @jdoliner and @JoeyZwicker. Here is what we are planning to do:
run-pipeline
command that can be used to manually trigger a pipeline. This will be necessary for running pipelines with no input repos, since they can't be triggered by commits.
When
max(blocks_per_file) < shards
, ornum_files < shards
in the case of a reduce job, some shards are started without any input data at all. This results in resources being wasted, and some jobs might even fail if they do not shut down gracefully with empty data.