pranab/chombo - Githubissues

Introduction

Hadoop based ETL and various utility classes for Hadoop and Storm

Philosophy

Simple to use
Input output in CSV format
Metadata defined in simple JSON file
Extremely configurable with many configuration knobs

Solution

Various relational algebra operation, including Projection, Join etc
Data extraction ETL to extract structured record from unstructured data
Data extraction ETL to extract structured record from JSON data
Data validation ETL with configurable rules and statistical parameters
Data profiling ETL with various techniques
Data transformation ETL with configurable transformation rules
Various statistical data exploration solutions
Data normalization
Seasonal data analysis
Various statistical parameter calculation
Various long term statistical parameter calculation with incremental data
Bulk inset, update and delete of Hadoop data
Bases classes for Storm Spout and Bolt
Utility classes for string, configuration
Utility classes for Storm and Redis

Blogs

The following blogs of mine are good source of details. These are the only source of detail documentation. Map reduce jobs in this projec are used in other projects including sifarish, avenir etc. Blogs related to thos projects are also relevant.

Build

For Hadoop 1

mvn clean install

For Hadoop 2 (non yarn)

git checkout nuovo
mvn clean install

For Hadoop 2 (yarn)

git checkout nuovo
mvn clean install -P yarn

For spark

Build chombo first in master branch with
- mvn clean install
- sbt publishLocal
Build chombo-spark in chombo/spark directory
- sbt clean package

Need help?

Please feel free to email me at pkghosh99@gmail.com

Contribution

Contributors are welcome. Please email me at pkghosh99@gmail.com

pranab / chombo