iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Don't submit jobs in parallel #41

Closed stenlarsson closed 5 years ago

stenlarsson commented 8 years ago

Submitting jobs in parallel is not thread safe. There is a "unique number generator" that is used when downloading files to the file cache. Unfortunately this only yields unique numbers for a single job, since each job has it's own LocalDistributedCacheManager.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-common/2.2.0/org/apache/hadoop/mapred/LocalDistributedCacheManager.java#95

With this change all jobs are submitted sequentially. In a local environment I think it means the jobs are not run in parallel, but in a distributed environment they should.

I was unfortunately not able to run the integration tests, so I don't know if this really works.

grddev commented 8 years ago

In theory, this fix doesn't fully address the problem, as jobs b and d could theoretically be launched at the same time with this change. It seems the only safe solution would be to submit the tasks while holding some form of lock.

parallel do
  sequence { a=job { ... }; b=job { ... } }
  sequence { c=job { ... }; d=job { ... } }
end
stenlarsson commented 8 years ago

You're right, it doesn't work if you have sequences in your parallel. To summarise:

Your suggestion with locks sounds better.

stenlarsson commented 5 years ago

Closing since I no longer use Rubydoop.