kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

KITE-1025 Use CombineFileInputFormat #390

Closed gabrielreid closed 9 years ago

gabrielreid commented 9 years ago

Wrap Avro and Parquet inputs with CombineFileInputFormat so that multiple small files don't result in multiple input splits and MapReduce tasks.

rdblue commented 9 years ago

@gabrielreid, thanks for working on this. It will be great to have combine support.

Can you write a little on how it is intended to work? I noticed that this removes the check in Crunch. What changes for regular MR and for Crunch?

gabrielreid commented 9 years ago

Sure, sorry for not adding some additional explanation previously.

The way that Crunch works (in recent versions) is that it combines multiple InputSplits into one (via CrunchCombineFileInputFormat) if both of the following are true:

Before this commit, Kite correctly set DISABLE_COMBINE_FILE to false for Parquet and Avro files, but due to the underlying input format not being a subclass of FileInputFormat anymore, Crunch didn't wrap it in CrunchCombineFileInputFormat, so no combined splits were created.

This commit introduces similar CombineFileInputFormat logic into Kite, as letting Crunch handle the combining is no longer feasible.

CombineFileInputFormat is used for Avro and Parquet files. This involves two things:

The JSON and CSV formats aren't changed, because it appears that their record readers are instantiated with a path to read from, which wouldn't directly map to the way CombineFileInputFormat works (as it relies on the initialize() method in a RecordReader to set up the path to read from). With some more work I think it would be feasible to add combine support for JSON and CSV however.

The underlying Combine functionality is automatically used in both Crunch and normal MR jobs.

Note that I've just updated this commit, as I discovered that there was an issue with it when multiple sources were being read in a single Crunch task. CrunchRecordReader has some logic that specifically checks if an incoming InputSplit is a CombineFileSplit. Seeing as Kite is now generating CombineFileSplits, Crunch was incorrectly interpreting these input splits as something created by Crunch. I've worked around this by adding AbstractKiteCombineFileInputFormat and KiteCombineFileSplit, which only exist to make Crunch treat them as "normal" input splits. It's not nice to add all that stuff in there, but I don't see any other way to get around it -- probably wouldn't be bad to update Crunch to do the same thing (i.e. have its own version of CombineFileSplit), and then this workaround could be removed from Kite.

gabrielreid commented 9 years ago

Weird, the Travis build is failing now with a timeout on org.kitesdk.data.crunch.TestCrunchDatasetsHive. Building everything under kite-data with the cdh4 profile seems to work fine for me -- is that a test that is a little unreliable?

rdblue commented 9 years ago

Yeah, we've seen flakiness with Travis on that test before. I restarted it and that usually fixes the problem.

mkwhitacre commented 9 years ago

Is there something outstanding on this issue or can it possibly be merged into master? It would definitely help with some situations we are seeing with hourly partitions and improve how fast we can roll them up into daily partitions.

rdblue commented 9 years ago

@mkwhitacre, I just need to review this and have been busy lately. I'll hopefully get to reviews this week.

mkwhitacre commented 9 years ago

Cool, was looking at improving the performance of some stuff and remembered @gabrielreid did this PR.

mkwhitacre commented 9 years ago

+1 to this change. Going to try building with these changes to verify it improve performance due to lots of mappers.

mkwhitacre commented 9 years ago

Saw a noticeable diff when running with this code. Specifically in our dev environment where we have lots of small files because of hourly extracts and low traffic volume of inflowing data (100k vs 1k of mappers).

gabrielreid commented 9 years ago

Thanks for taking it for a spin @mkwhitacre. That's pretty much in line with what I've seen running it as well.

rdblue commented 9 years ago

Merged as https://github.com/kite-sdk/kite/commit/e92bf712. Thanks @gabrielreid, this is a really valuable new feature! And thanks @mkwhitacre for reviewing and testing it out!