mozilla / jydoop

Efficient Hadoop Map-Reduce in Python
Other
31 stars 19 forks source link

Sequence file support #37

Closed mreid-moz closed 11 years ago

mreid-moz commented 11 years ago

This branch includes changes to support new Mapper types (besides HBase) including Text-based sequence files (useful for TestPilot data) and the PythonKey/PythonValue sequence files that are output by jydoop jobs.

The Mapper type is specified by creating a function in the jydoop job that returns either HBASE, TEXT, or JYDOOP, with HBASE being the default so that existing jobs do not need to be modified.

The HBaseDriver class is renamed to HadoopDriver to reflect its more general nature.

I also added an option to skip local output (and also skip the corresponding delete of data inside HDFS) to support the use of jydoop output as jydoop input. This allows one to setup data-processing pipelines with multiple stages of processing without having to download potentially large data sets between stages.

Sorry about the noisy intermediate commits, I ended up refactoring a few things as I went along.

Finally, I would like to reduce the amount of code in the Mapper classes, so if this seems like a sound approach, I can refactor them to inherit from a common ancestor.

tarasglek commented 11 years ago

This seems reasonable. Please expand the README to document new use-cases, etc.