is00hcw / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 1 forks source link

Batch applier fails and corrupts CSV file if interval is too low #1080

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Deploy MySQL to Redshift topology with a very low setting for: 
--svc-applier-block-commit-interval=10
2. Try running Croc.

What is the expected output?

Test pass.

What do you see instead?

Slave fails:

$ trepctl status
pendingError           : Stage task failed: stage=q-to-dbms seqno=2011 fragno=0
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000002:0000000000740410;-1
pendingErrorSeqno      : 2011
pendingExceptionMessage: CSV loading failed: schema=croc_jenkins 
table=croc_deletevarcharnopkey CSV 
file=/tmp/staging/cookbook/staging0/croc_jenkins-croc_deletevarcharnopkey-2011.c
sv message=Wrapped org.postgresql.util.PSQLException: ERROR: syntax error at or 
near "AND"
                           Position: 142 (../../tungsten-replicator//samples/scripts/batch/redshift.js#190)

What is the possible cause?

Primary key information is not filled in by SimpleBatchApplier. Also generated 
CSV file is corrupt (contains a CR in the middle):

"I","2011","1","2015-01-05 
13:33:24.000","0","4.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ4
.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ4.BnRyZxVJ"

What is the proposed solution?

Workaround: use interval with an "s" suffix for seconds. Eg.: 
--svc-applier-block-commit-interval=10s not 
--svc-applier-block-commit-interval=10

Additional information

The failure points to a possible race condition in the batch applier code.

Original issue reported on code.google.com by linas.vi...@continuent.com on 5 Jan 2015 at 4:03

GoogleCodeExporter commented 9 years ago
That's an interesting bug.  I think you are perhaps seeing a concurrency bug 
due to lack of proper synchronization on the objects (i.e., Java class 
instances) passed from the batch thread into the loading.  If the switch is 
fast enough and without synchronization the receiving threads may partially 
filled out data, because instance data may still be resident on machine 
registers or L1-L3 caches on separate cores or CPUs. The usual fix is to put 
synchronization on methods belonging to shared objects. This guarantees data 
will be flushed and visible across threads. 

Original comment by robert.h...@continuent.com on 5 Jan 2015 at 4:25

GoogleCodeExporter commented 9 years ago
I'm raising priority for this, because it worries me that CSV file has been 
corrupt, which could potentially lead to data corruption if we're not lucky to 
get an error.

Original comment by linas.vi...@continuent.com on 6 Jan 2015 at 5:36