Enable replicator to load data into Hadoop from multiple replication services without conflicts

GoogleCodeExporter commented 9 years ago

1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

Most Hadoop implementations receive data from more than one location.  The 
current hadoop.js script uses a single default location for all CSV data which 
means that if you have multiple replication services loading data they may 
collide.  This is especially problematic when replicators run on different 
hosts and load into a Hadoop cluster, since conflicts in that case would not be 
noticed until too late. 

The replicator will be extended to write staging data into a sub-directory that 
is qualified with the replication service name.  Hive schema generation will be 
extended accordingly to take into account this location. 

3. Describe the feature interface

The hadoop.js script will be extended to write CSV data into the following 
default location: 

   /user/tungsten/staging/<service name>

The ddlscan commands used to generate schema will be accordingly enhanced to 
insert a service name in external table locations and base table generation 
will allow the service name to be added as a prefix. 

4. Give an idea (if applicable) of a possible implementation

See above. 

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

Fan-in topologies into Hadoop can be easily supported without specialized user 
configuration. 

5b. What hardship will the human race have to endure if this feature is
implemented.

Existing Hadoop implementations will have to bear a modest upgrade as staging 
CSV data will now go to a new location. 

6. Notes

Original issue reported on code.google.com by robert.h...@continuent.com on 6 May 2014 at 7:53

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r2220.

Original comment by robert.h...@continuent.com on 9 May 2014 at 1:34

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

This is now fixed.  

1. The hadoop.js script writes CSV files to a separate directory for each 
replication service.  For example if replication service batch1 is writing data 
it will by default write everything to /user/tungsten/staging/batch1.   

2. The Velocity templates are updated to add the service name to the file 
location for staging data as well as to prepend it to generated schema names.  
Here is an example invocation to generate schema definitions: 

/opt/continuent/tungsten/tungsten-replicator/bin/ddlscan -template 
ddl-mysql-hive-0.10-staging.vm -user tungsten -pass secret -url 
jdbc:mysql:thin://logos1:3306/croc -db croc  -opt hdfsStagingDir 
/user/tungsten/staging/batch1 -opt schemaPrefix batch1_ 

3. Old loading behavior is still available in script hadoop_single.js.  It 
ignores the service name as hadoop.js used to. 

WARNING:  This will change the behavior of existing replicators.  Users should 
also get the latest version of the continuent-tools-hadoop utilities on Github, 
which is updated to accommodate this new behavior.  
(https://github.com/continuent/continuent-tools-hadoop)

To test behavior set up two Tungsten master replicator services that replicate 
into a single replicator process with slave services for each master.  Put load 
on both source DBMS systems and observe that data load correctly for each 
service.  You can use the Github tools to check data independently for each 
service to ensure that they are consistent.

Original comment by robert.h...@continuent.com on 9 May 2014 at 1:46

Changed state: QA

GoogleCodeExporter commented 9 years ago

Verified with build 3.0.0-376. Installed a fan-in topology, put some load and 
check the data.

Original comment by csaba.si...@continuent.com on 22 Sep 2014 at 6:38

Changed state: Documenting

GoogleCodeExporter commented 9 years ago

The documentation has been updated to reflect this difference

Original comment by mc.br...@continuent.com on 13 Oct 2014 at 9:41

Changed state: Fixed

is00hcw / tungsten-replicator

Enable replicator to load data into Hadoop from multiple replication services without conflicts #898