pachyderm / pachyderm

Data-Centric Pipelines and Data Versioning
https://www.pachyderm.com/
Apache License 2.0
6.18k stars 567 forks source link

Make sure that every shard starts with some input data #260

Closed derekchiang closed 8 years ago

derekchiang commented 8 years ago

When max(blocks_per_file) < shards, or num_files < shards in the case of a reduce job, some shards are started without any input data at all. This results in resources being wasted, and some jobs might even fail if they do not shut down gracefully with empty data.

derekchiang commented 8 years ago

@jdoliner 1.0 worthy?

jdoliner commented 8 years ago

I think so, this actually breaks a number of very common jobs... for example it breaks the fruit stand. So would be good to get this in.

derekchiang commented 8 years ago

@jdoliner the only solution I came up with was to have job-shim scan the content of /pfs and if there are no files (in the case of a reduce job) or if all files are empty (in the case of a map job), then instead of running the user-provided cmd, we simply report that the job has succeeded and exit. But that does't seem particularly elegant. Thoughts?

jdoliner commented 8 years ago

Yup here's how I'd do it. First we should extend InspectCommit to take a FilterShard parameter which it'll pass to each of the sub calls to inspectFile that it does.

Then in pps.APIServer we can do the following

Shard:
    for shards := request.Shards; shards > 0; shards-- {
        for i := 0; i < shards; i++ {
            if InspectCommit(commit, FilterShard{Number:i, Modulus: shards}).Size == 0 {
                continue Shard
            }
        }
}
derekchiang commented 8 years ago

It just occurred to me that there is a class of pipelines that might be ok with empty inputs. For instance, in my 50GB wordcount use case, I have a pipeline whose input repo simply serves as a "trigger". What I do is that I make an empty commit to the input repo, which will kickstart the pipeline, which then starts downloading the 50GB of input data from the internet and putting it into pfs.

So in general, there is a class of pipelines who are really "workers" that don't necessarily operate on any inputs. If we were to get rid of jobs with empty inputs, this class of pipelines won't work.

Maybe there can be a flag in the pipeline spec that specifies whether the pipeline is ok with empty input? If the flag is set, then we always respect the shards flag even if some shards might be started without any input data.

@jdoliner thoughts?

JoeyZwicker commented 8 years ago

I don't think we want to be encouraging empty commits or "trigger" pipelines as a hacky way to get around users needing to manually start pipelines or have them triggered by cron.

derekchiang commented 8 years ago

@JoeyZwicker I thought the only way to start a pipeline was to add a commit to the input repo. What did you mean by "manually starting pipelines"? How would you trigger a pipeline by cron?

JoeyZwicker commented 8 years ago

Currently this is true, but #235 exists because I don't think we want this to be the case for users asap

JoeyZwicker commented 8 years ago

There are a few open discussions/issues around adding a run pipeline command ("running a pipeline manually") and creating a pipeline with no input repos that instead gets run based on some timing parameter, something like cron.

JonathanFraser commented 8 years ago

@derekchiang while this is not the current behaviour, one use for an empty commit is overall providence tracking.

Lets take a system that has some number of source repos and then it fans out through the processing DAG. No say I want to gather together all the results of a given commit to the source repos. How would that be done if the result of that commit only made it partially down the DAG. I would either have to: a) wait until a commit comes along which propagates to the end of the DAG b) poll pachyderm periodically for the state and extract the data c) inject code into each pipeline step to report results

If however all pipeline steps executed, but with some containing no new data (or zero data). Then it is easy. I simply set up a stage at the end of the pipeline which then scans over it. In this case it will trigger every time there is a new input commit.

This also starts to collapse things and make the commits more one to one.

jdoliner commented 8 years ago

I suspect we're going to see a lot of interesting uses of empty commits. Over time we'll migrate the common use cases into full fledged features. It seems like we already have several of them, in summary:

jdoliner commented 8 years ago

As we come up with more people should feel free to turn them into full fledged issues. Until we feel confident that empty commits truly aren't useful anymore we'll allow them. But long term it's probably going to make sense to try to push to disallow them since it'll issues like this a little more obvious.


For the time being I think we should offer the following semantics:

Do those semantics seem reasonable to people?

derekchiang commented 8 years ago

I think this mostly makes sense, but I feel like it should be more explicit. Why don't we just have a flag in the pipeline spec that specifies whether the pipeline is ok with empty inputs?

derekchiang commented 8 years ago

I'm also in favor of having a run-pipeline command. Having a "trigger" repo does not seem particularly elegant.

Basically, I see run-pipeline being used for pipelines that have external effects (e.g. start a DDOS attack) or are non-deterministic (web scrapper).

jdoliner commented 8 years ago

Letting users be explicit about what they want is something I'm generally in favor of because it prevents them from inadvertently doing something they don't want to do. However there is a downside, by making people be explicit you make it a little bit harder for them to do what they want to do. Specifically with a flag for empty inputs I think we'd have 2 impacts:

Of those 2 the latter worries me a lot more. Because in reality we're not protecting people from anything that bad. If they have a pipeline that can't handle empty commits and they make an empty commit then it'll error which will hopefully make it clear to the user what was going on. But the latter case seems like it could really leave someone confused and frustrated for a while.

An alternative idea:

It seems like the real worry with the first case is that commits might unexpectedly be empty. Imagine a service thats being repaired 1 day and didn't emit any logs. If we're worried about that case maybe we should make users be explicit when they commit that they know the commit is empty. Ie for an empty commit you'd get the following user experience:

$ pachctl finish-commit foo bar
Error: commit foo/bar is empty (add --allow-empty to commit)
$ pachctl finish-commit foo bar --allow-empty
derekchiang commented 8 years ago

Discussed offline with @jdoliner and @JoeyZwicker. Here is what we are planning to do:

  1. Allow pipelines to have no input repos.
  2. When a pipeline has input repos, we make sure that every parallel job sees some input data. In other words, we dynamically adjust the degree of parallelism based on how much input data there is.
  3. When a pipeline has no input repos, we always respect the degree of parallelism specified in the pipeline spec (the option is currently called "shards" but we are renaming it to "parallelism" for clarity).
  4. We add a run-pipeline command that can be used to manually trigger a pipeline. This will be necessary for running pipelines with no input repos, since they can't be triggered by commits.