thinkaurelius / faunus

Graph Analytics Engine
http://faunus.thinkaurelius.com
Apache License 2.0
262 stars 58 forks source link

Faunus 0.10 is not compatible with Cloudera CDH4 #100

Open kottmann opened 11 years ago

kottmann commented 11 years ago

We have a Cloudera CDH4 cluster and run into a compatibility issue with Faunus, CDH4 is based on Hadoop 2.x instead of Hadoop 1.x.

The Mapper.Context constructor signature changed and causes a NoSuchMethodError when called from Faunus. Here is the stack trace: java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Mapper$Context.(Lorg/apache/hadoop/mapreduce/Mapper;Lorg/apache/hadoop/conf/Configuration;Lorg/apache/hadoop/mapreduce/TaskAttemptID;Lorg/apache/hadoop/mapreduce/RecordReader;Lorg/apache/hadoop/mapreduce/RecordWriter;Lorg/apache/hadoop/mapreduce/OutputCommitter;Lorg/apache/hadoop/mapreduce/StatusReporter;Lorg/apache/hadoop/mapreduce/InputSplit;)V at com.thinkaurelius.faunus.mapreduce.MemoryMapper$MemoryMapContext.(MemoryMapper.java:32) at com.thinkaurelius.faunus.mapreduce.MapSequence$Map.setup(MapSequence.java:30) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:263)

We are getting this error even so we are running MRv1. The error above is the first one we got, there might be more compatibility issues.

The issue was first reported in the aureliusgraphs google group: https://groups.google.com/forum/#!topic/aureliusgraphs/B3gvUWOQ2cA

kottmann commented 11 years ago

Subclassing the Mapper.Context class does not seem to work for both version, since in each version the constructor is different, but the subclass has to call the super constructor. Its probably possible to solve this somehow via reflection but that would not be an elegant solution.

The MapSequence seems to run multiple Mappers during a single Mapper invocation from the framework. It looks like that the purpose of that is to run multiple transformations on the input data in a pipeline without running multiple MapReduce Jobs.

Would it be possible to make these transformations without implementing the Mapper interface, and therefore eliminating the need to subclass Mapper.Context?

okram commented 11 years ago

The purpose of the MapSequence is to do in-memory mapping when you have a chain of mappers in a row. I would have to think about how to remove the need to subclass Mapper.Context. If you have an idea and can provide a pull request, that would be most appreciated.

On Feb 19, 2013, at 7:32 AM, Joern Kottmann notifications@github.com wrote:

Subclassing the Mapper.Context class does not seem to work for both version, since in each version the constructor is different, but the subclass has to call the super constructor. Its probably possible to solve this somehow via reflection but that would not be an elegant solution.

The MapSequence seems to run multiple Mappers during a single Mapper invocation from the framework. It looks like that the purpose of that is to run multiple transformations on the input data in a pipeline without running multiple MapReduce Jobs.

Would it be possible to make these transformations without implementing the Mapper interface, and therefore eliminating the need to subclass Mapper.Context?

— Reply to this email directly or view it on GitHub.

kottmann commented 11 years ago

Ok, so we can have something like this: Mapper1 | Mapper2 | Mapper3 | Reducer.

What do you think about ChainMapper to set up the Mappers? As far as I can see it is available in both versions.

ChainMapper JavaDoc: http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/ChainMapper.html

okram commented 11 years ago

ChainMapper doesn't work for mapreduce library -- only mapred. Hence the reason I created MapSequence :(.

On Feb 19, 2013, at 10:14 AM, Joern Kottmann notifications@github.com wrote:

Ok, so we can have something like this: Mapper1 | Mapper2 | Mapper3 | Reducer.

What do you think about ChainMapper to set up the Mappers? As far as I can see it is available in both versions.

ChainMapper JavaDoc: http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/ChainMapper.html

— Reply to this email directly or view it on GitHub.

karkumar commented 11 years ago

Hi all, I managed to get Faunus working with CDH 4.2. There were roughly two sets of changes I had to make :

  1. I had to update the pom.xml to build against the Cloudera jars. I suspect that there is significant flexibility here, as long as you are building against either the Cloudera jar, or the corresponding Apache release everything should work. You can find a list of the Apache releases that are packaged in CDH4.2 here:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/PkgVer/3.25.2013/CDH-Version-and-Packaging-Information/cdhvd_topic_3_1.html?scroll=topic_3_1

  1. I did slightly re-implement com.thinkarelius.faunus.mapredue.MemoryMapper. First, a little background - somewhere between Hadoop 1.1.2 and Hadoop 2.0 Mapper.Context changed from extending the class org.apache.hadoop.mapreduce.MapContext to implementing an interface of the same name. At the same time, the code that resided in org.apache.hadoop.mapreduce.MapContext (in Hadoop 1.1) was moved to org.apache.hadoop.mapreduce.task.MapContexImpl (in 2.x).

The simplest way to get around these changes is to reimplement MemoryMapper.MemoryMapContext to encapsulate org.apache.hadoop.mapreduce.MapContextImpl and simply pass all function calls to the member variable.

The right answer for Hadoop 2/ CDH compatibility is probably to create another build profile. MRUnit ( https://github.com/apache/mrunit/blob/trunk/pom.xml) does this particularly effectively. If there is an interest in going this route, or other suggestions on how to create a build that works with both versions, I would be happy to volunteer my time to implement

For now, I've forked Faunus and implemented the fixes. The fork can be found at https://github.com/karkumar/faunus and the fix is in the cdh4-port branch.

Thanks again!

karkumar commented 11 years ago

Hey guys, so I just did the update to 0.4.0 snapshot. Apache Hadoop 2 compatibility. The only change I had to make was to change instances of TaskAttemptContext to TaskAttemptContextImpl. Again the fork can be found at https://github.com/karkumar/faunus and the fix is in the cdh4-port branch.

okram commented 11 years ago

The problem with that (correct me if I'm wrong) is that TaskAttemptContextImpl does NOT work with Hadoop 1.y.z. Hadoop 2 has not seen a stable release yet. Until Apache Hadoop goes 2.0-stable, then we are going to stick with 1.y.z API.

If you can figure out how to make it 2.0 AND 1.y.z compatible, I would definitely make that change immediately.

On May 2, 2013, at 3:35 PM, Karthik Ramachandran notifications@github.com wrote:

Hey guys, so I just did the update to 0.4.0 snapshot. Apache Hadoop 2 compatibility. The only change I had to make was change instances of TaskAttemptContext to TaskAttemptContextImpl. Again the fork can be found at https://github.com/karkumar/faunus and the fix is in the cdh4-port branch.

— Reply to this email directly or view it on GitHub.

karkumar commented 11 years ago

Yup, thats correct. That said, there are only two classes that are really preventing Faunus from being Hadoop 2 compatible : MemoryMapper.MemoryMapContext and TaskAttemptContext. Really the only change between Hadoop 1 and Hadoop 2 is that these clases became abstract and their implementations were moved to impl classes in alternate packages.

So you could just package your own version of MapContext and TaskAttemptContext in Faunus -- literally cut and paste them out of the Hadoop 1.y.z code base into Faunus--and then you should be able to run against either 1.y.z or 2.0. The solution isn't elegant, but it will probably work.

You would probably also want to add a maven build profile that changes the Hadoop and MRUnit artifacts to the 2.0 artifacts.

If you'd like I can experiment with this change in my fork. If not, the over head of keeping my fork up to date is fairly minor, so I can keep doing that and updating this ticket.

Thanks for taking the time to think about my request, it's much appreciated.

okram commented 11 years ago

Yup, thats correct. That said, there are only two classes that are really preventing Faunus from being Hadoop 2 compatible : MemoryMapper.MemoryMapContext and TaskAttemptContext. Really the only change between Hadoop 1 and Hadoop 2 is that these clases became abstract and their implementations were moved to impl classes in alternate packages.

Gotcha. So you could just package your own version of MapContext and TaskAttemptContext in Faunus -- literally cut and paste them out of the Hadoop 1.y.z code base into Faunus--and then you should be able to run against either 1.y.z or 2.0. The solution isn't elegant, but it will probably work.

That is an interesting idea…………..hmmmmm. I will think on that for faunus04.

You would probably also want to add a maven build profile that changes the Hadoop and MRUnit artifacts to the 2.0 artifacts.

Ah. Yea, thats a problem there.

If you'd like I can experiment with this change in my fork. If not, the over head of keeping my fork up to date is fairly minor, so I can keep doing that and updating this ticket.

Please.

Thanks for taking the time to think about my request, it's much appreciated.

Thank you for your interest.

karkumar commented 11 years ago

Hi,

I think I have found a reasonable solution to this incompatibility, which allows for us to generate hadoop 1 compatible and hadoop 2 compatible binaries from the same code base.

There are three different parts to the fix:

  1. I created a factory (MemoryMapContextFactory) that wraps MemoryMapper.Context in a cglib proxy object. The proxy object checks to see wether Hadoop 1 or Hadoop 2 is being used and constructs the context object appropriately. The proxy object routes function calls to either MemoryMapper.Context or to Mapper.Context (if Hadoop 1) or to either MemoryMapper.Context or to MapperContextImpl (if Hadoop 2). The proxy object I wrote is not as efficient as it could be, I wanted to see if their was interest in this type of solution before spending any more time on the fix.
  2. I created a factory (TaskAttemptContextFactory) that checks to see if we are using Hadoop 1 or Hadoop 2 and creates the correct TaskAttemptContext. Since this change was simply to get the tests working when building against Hadoop 1 or Hadoop 2, I also changed all tests that directly instantiated a TaskAttemptContext to use the TaskAttemptContextFactory.

If we stop with steps 1 and 2 we create a faunus jars that should be compatible with Hadoop 1 and Hadoop 2. However, the distribution that is created is only compatible with Hadoop 1 because it includes the wrong Hadoop jars in the lib directory. So I created a build profile:

  1. I added a hadoop1 and hadoop2 build profile to the pom.xml The default profile is hadoop1. If you wish to build agains Hadoop 2 add -P hadoop2 to the maven command line interface. For exampe: mvn package hadoop 2.

These changes can be found in the proxying-port branch of my fork : https://github.com/karkumar/faunus/tree/proxying-port

This solution isn't ideal because it imposes a small cost on every context.write. However, if we clean up the proxy object a bit that cost should be relatively minor.

Let me know if this is a viable option, I would be happy to spend some more time cleaning this up and testing.

Right now, you should be able to build and run all the tests agains either build profile. All tests should pass. I've also tested the code against my Hadoop 2 cluster and it seems to work. I haven't tested agains a Hadoop 1 cluster.

Again, thanks for taking the time to think about my request.

karkumar commented 11 years ago

I just wanted to revisit this issue in light of the recent release of 2.1.0. In 2.1.0 they claim that there is now source compatibility for jobs that use Hadoop 1.x Mapreduce APIs and Hadoop 2.0 (http://hadoop.apache.org/docs/r2.1.0-beta/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html)

Has anyone had a chance to check try this out with Faunus ? Is there any interest in creating multiple build profiles for Faunus? One that builds again Hadoop 1.0 jars and on that builds agains Hadoop 2.1 jars?

Thanks

okram commented 11 years ago

Thanks for your work. I've used your MemoryMapper code and have created a Hadoop2 branch of Faunus: https://github.com/thinkaurelius/faunus/tree/hadoop2