Open psutter opened 12 years ago
Isn't one of the benefits of this framework is to eliminate redundant work? The Use case is that the 60,000 files and been previously been process and 2 new files are added and another has been replaced. Running the job over all 60,000 files would be less desirable than the time required to check the dependencies, right? Of course, there are times when you know you want to replace or force the re-processing of all files, so the proposed feature would be to '-skip' dependency checking or '-force' reprocessing. I don't like the idea of spawning multiple threads to check HDFS files due to complexity that it would add to Hadoop client code.
I think we can reduce the metadata IO 100-to-1 using the manifest design, allowing us to keep the dependency checking without a performance penalty
On Fri, Jan 6, 2012 at 10:55 AM, Douglas Moore < reply@reply.github.com
wrote:
Isn't one of the benefits of this framework is to eliminate redundant work? The Use case is that the 60,000 files and been previously been process and 2 new files are added and another has been replaced. Running the job over all 60,000 files would be less desirable than the time required to check the dependencies, right? Of course, there are times when you know you want to replace or force the re-processing of all files, so the proposed feature would be to '-skip' dependency checking or '-force' reprocessing. I don't like the idea of spawning multiple threads to check HDFS files due to complexity that it would add to Hadoop client code.
Reply to this email directly or view it on GitHub: https://github.com/tap-git/tap/issues/2#issuecomment-3388903
When running a job with 60,000 input files stored in S3, it can take two hours to check for file existence in generatePlan before the job even begins.
Solutions: