is00hcw / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 1 forks source link

Using incorrect platform character set corrupts row data on extraction #282

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Define a table with UTF8 as the default charset, as in the following 
example:  
CREATE TABLE `croc_insertvarcharutf8` (
  `id` int(11) NOT NULL,
  `f_data` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

2. Configure row replication. 
3. Add Tungsten master with --mysql-use-bytes-for-string=false.  This 
suppresses binary transfer of data. 
4. Populate UTF8 characters in the test table (the CROC InsertVarcharUtf8 test 
does this at Continuent). 

What is the expected output?

Replicator should correctly extract UTF characters. 

What do you see instead?

Replicator corrupts data on extraction.  

What is the possible cause?

For most Linux hosts the Java platform charset defaults to ISO-1.  The 
MySQLExtractor class consequently messes up the translation to Unicode strings. 

What is the proposed solution?

Replicator should either set the charset correctly at installation time in 
wrapper.conf or the extractor should figure out the character set in the binlog 
automatically.  Either way, there should be some safety check to avoid 
corrupting data in the event of misconfiguration. 

Additional information

The current workaround for this problem is to configure wrapper.conf manually 
to set the platform charset.  Add the following line to wrapper.conf and 
restart the replicator: 

wrapper.java.additional.4=-Dfile.encoding=UTF8

Use labels and text to provide additional information.

Original issue reported on code.google.com by robert.h...@continuent.com on 8 Jan 2012 at 10:12

GoogleCodeExporter commented 9 years ago
You can now set the platform character set as follows.   I also added a fix to 
allow setting of the timezone, since this is also in wrapper.conf: 

tungsten-installer --direct -a \
  ...
  --java-file-encoding=UTF8 \
  --java-user-timezone=GMT-7 \
  --start-and-report

This sets the JVM timezone to GMT-7 (Pacific Daylight Time) and the platform 
file encoding to UTF8.  

Original comment by robert.h...@continuent.com on 28 May 2012 at 10:17