Don't submit jobs in parallel

stenlarsson commented 8 years ago

Submitting jobs in parallel is not thread safe. There is a "unique number generator" that is used when downloading files to the file cache. Unfortunately this only yields unique numbers for a single job, since each job has it's own LocalDistributedCacheManager.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-common/2.2.0/org/apache/hadoop/mapred/LocalDistributedCacheManager.java#95

With this change all jobs are submitted sequentially. In a local environment I think it means the jobs are not run in parallel, but in a distributed environment they should.

I was unfortunately not able to run the integration tests, so I don't know if this really works.

grddev commented 8 years ago

In theory, this fix doesn't fully address the problem, as jobs b and d could theoretically be launched at the same time with this change. It seems the only safe solution would be to submit the tasks while holding some form of lock.

parallel do
  sequence { a=job { ... }; b=job { ... } }
  sequence { c=job { ... }; d=job { ... } }
end

stenlarsson commented 8 years ago

You're right, it doesn't work if you have sequences in your parallel. To summarise:

Without threads, the sequences will not run in parallel
With threads, we might submit jobs in parallel

Your suggestion with locks sounds better.

stenlarsson commented 5 years ago

Closing since I no longer use Rubydoop.

iconara / rubydoop

Don't submit jobs in parallel #41