Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.
Join the conversation about Elephant-Bird on the developer mailing list.
git clone git://github.com/twitter/elephant-bird.git
mvn package
mvn javadoc:javadoc
Note: For any of the LZO-based code, make sure that the native LZO libraries are on your java.library.path
. Generally this is done by setting JAVA_LIBRARY_PATH
in pig-env.sh
or hadoop-env.sh
. You can also add lines like
PIG_OPTS=-Djava.library.path=/path/to/my/libgplcompression/dir
to pig-env.sh
. See the instructions for Hadoop-LZO for more details.
There are a few simple examples that use the input formats. Note how the Protocol Buffer and Thrift classes are passed to input formats through configuration.
Elephant Bird release artifacts are published to the Sonatype OSS releases repository and promoted from there to Maven Central. From time to time we may also deploy snapshot releases to the Sonatype OSS snapshots repository.
-Dprotobuf.version=2.3.0
)Elephant-Bird defines majority of its depenendencies in maven provided scope. As a result these dependencies are not transitively Elephant-Bird modules. Please see wiki page for more information.
Elephant-Bird provides input and output formats for working with a variety of plaintext formats stored in LZO compressed files.
Additionally, protocol buffers and thrift messages can be stored in a variety of file formats.
Hadoop provides two API implementations: the the old-style org.apache.hadoop.mapred
and new-style org.apache.hadoop.mapreduce
packages. Elephant-Bird provides wrapper classes that allow unmodified usage of mapreduce
input and output formats in contexts where the mapred
interface is required.
For more information, see DeprecatedInputFormatWrapper.java and DeprecatedOutputFormatWrapper.java
Elephant-bird published packages are tested with both Hadoop 1.x and 2.x.
Loaders and storers are available for the input and output formats listed above. Additionally, pig-specific features include:
Elephant-Bird provides Hive support for reading thrift and protocol buffers. For more information, see How to use Elephant Bird with Hive.
Elephant-Bird provides hadoop Input/Output Formats and pig Load/Store Funcs for creating + searching lucene indexes. See Elephant Bird Lucene
DynamicMessage
ProtobufBlockWriter
)Elephant Bird requires Protocol Buffer compiler at build time, as generated classes are used internally. Thrift compiler is required to generate classes used in tests. As these are native-code tools they must be installed on the build machine (java library dependencies are pulled from maven repositories during the build).
We provide InputFormats, OutputFormats, Pig Load / Store functions, Hive SerDes,
and Writables for working with Thrift and Google Protocol Buffers.
We haven't written up the docs yet, but look at ProtobufMRExample.java
, ThriftMRExample.java
, people_phone_number_count.pig
, people_phone_number_count_thrift.pig
under examples
directory for reflection-based dynamic usage.
We also provide utilities for generating Protobuf-specific Loaders, Input/Output Formats, etc, if for some reason you want to avoid
the dynamic bits.
Reading and writing Hadoop SequenceFiles with Pig is supported via classes SequenceFileLoader and SequenceFileStorage. These classes make use of a WritableConverter interface, allowing pluggable conversion of key and value instances to and from Pig data types.
Here's a short example: Suppose you have SequenceFile<Text, LongWritable>
data
sitting beneath path input
. We can load that data with the following Pig
script:
REGISTER '/path/to/elephant-bird.jar';
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
pairs = LOAD 'input' USING $SEQFILE_LOADER (
'-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
) AS (key: chararray, value: long);
To store {key: chararray, value: long}
data as SequenceFile<Text, LongWritable>
, the following may be used:
%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
STORE pairs INTO 'output' USING $SEQFILE_STORAGE (
'-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
);
For details, please see Javadocs in the following classes:
Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on github.
Each new release since 2.1.3 has a tag. The latest version on master is what we are actively running on Twitter's hadoop clusters daily, over hundreds of terabytes of data.
Major contributors are listed below. Lots of others have helped too, thanks to all of them! See git logs for credits.