Closed hansmire closed 10 years ago
Thanks, this looks like a good change. I'm not sure what do to about the hadoop version issues, but for now I'm going to stick with 1.0.3 (perhaps I'll bump to 1.2.1 soon though) for now.
I had to revert this merge, I tested it on one of my workflow and it was many many times slower. It combined all the maps into a single, huge map losing the advantage of parallelism. I need to read up more on the CombinedFileSplit I guess.
You should be able to modify this setting to reduce the size of the FileSplit.
"mapred.max.split.size"
ok, as I said, I need to familiarize myself more with it before I merge it back in.
On Wed, Apr 2, 2014 at 10:29 PM, Max Hansmire notifications@github.comwrote:
You should be able to modify this setting to reduce the size of the FileSplit.
"mapred.max.split.size"
Reply to this email directly or view it on GitHubhttps://github.com/nathanmarz/dfs-datastores/pull/45#issuecomment-39414116 .
would you mind resubmitting as a new PR based off the current develop branch? I bumped the hadoop version to 1.2.1
Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.
There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.
Two possible fixes.