Preposterous startup time

psutter commented 12 years ago

When running a job with 60,000 input files stored in S3, it can take two hours to check for file existence in generatePlan before the job even begins.

Solutions:

Implement this as many threads, perhaps max(50, number of files / 30)
Have a flag to bypass this check
Automatically skip this check if there are >30 files, since this is almost certainly a list of files generated elsewhere.

private PipePlan generatePlan(List<PhaseError> errors) {
    ...
    while (!toGenerate.isEmpty()) {
        Pipe file = toGenerate.iterator().next();
        toGenerate.remove(file);
        boolean exists = file.exists(baseConf);
        if (exists && !file.isObsolete(baseConf)
                && (!forceRebuild || file.getProducer() == null)) {
            if (!generated.contains(file)) {
                System.out.println("File: " + file.getPath()
                        + " exists and is up to date.");
                // ok already
                generated.add(file);
                plan.fileCreateWith(file, null);
            }

dmoore247 commented 12 years ago

Isn't one of the benefits of this framework is to eliminate redundant work? The Use case is that the 60,000 files and been previously been process and 2 new files are added and another has been replaced. Running the job over all 60,000 files would be less desirable than the time required to check the dependencies, right? Of course, there are times when you know you want to replace or force the re-processing of all files, so the proposed feature would be to '-skip' dependency checking or '-force' reprocessing. I don't like the idea of spawning multiple threads to check HDFS files due to complexity that it would add to Hadoop client code.

psutter commented 12 years ago

I think we can reduce the metadata IO 100-to-1 using the manifest design, allowing us to keep the dependency checking without a performance penalty

On Fri, Jan 6, 2012 at 10:55 AM, Douglas Moore < reply@reply.github.com

wrote:

Isn't one of the benefits of this framework is to eliminate redundant work? The Use case is that the 60,000 files and been previously been process and 2 new files are added and another has been replaced. Running the job over all 60,000 files would be less desirable than the time required to check the dependencies, right? Of course, there are times when you know you want to replace or force the re-processing of all files, so the proposed feature would be to '-skip' dependency checking or '-force' reprocessing. I don't like the idea of spawning multiple threads to check HDFS files due to complexity that it would add to Hadoop client code.

Reply to this email directly or view it on GitHub: https://github.com/tap-git/tap/issues/2#issuecomment-3388903

tap-git / tap

Preposterous startup time #2