nathanmarz / dfs-datastores

Dead-simple vertical partitioning, compression, appends, and consolidation of data on a distributed filesystem.
BSD 3-Clause "New" or "Revised" License
215 stars 82 forks source link

Mhansmire revert to hadoop 1.0.3 #45

Closed hansmire closed 10 years ago

hansmire commented 10 years ago

Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.

There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.

Two possible fixes.

sorenmacbeth commented 10 years ago

Thanks, this looks like a good change. I'm not sure what do to about the hadoop version issues, but for now I'm going to stick with 1.0.3 (perhaps I'll bump to 1.2.1 soon though) for now.

sorenmacbeth commented 10 years ago

I had to revert this merge, I tested it on one of my workflow and it was many many times slower. It combined all the maps into a single, huge map losing the advantage of parallelism. I need to read up more on the CombinedFileSplit I guess.

hansmire commented 10 years ago

You should be able to modify this setting to reduce the size of the FileSplit.

"mapred.max.split.size"

sorenmacbeth commented 10 years ago

ok, as I said, I need to familiarize myself more with it before I merge it back in.

On Wed, Apr 2, 2014 at 10:29 PM, Max Hansmire notifications@github.comwrote:

You should be able to modify this setting to reduce the size of the FileSplit.

"mapred.max.split.size"

Reply to this email directly or view it on GitHubhttps://github.com/nathanmarz/dfs-datastores/pull/45#issuecomment-39414116 .

http://about.me/soren

sorenmacbeth commented 10 years ago

would you mind resubmitting as a new PR based off the current develop branch? I bumped the hadoop version to 1.2.1