hugowan / maatkit

Automatically exported from code.google.com/p/maatkit
0 stars 0 forks source link

mk-table-sync does not order sets of rows it is syncing #127

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
If this is a problem report, what steps will reproduce the problem?
I didn't try to reproduce the problem since I don't have non-critical
database to lose data again. But here is the command I used when the
problem happened:
sudo -H mk-table-sync --verbose --execute --synctomaster --replicate
gprod.checksum /var/run/mysqld/mysqld.sock

What is the expected output? What do you see instead?
Expected to connect to master and sync tables that have different checksum
to the slave. But instead I had data deleted on the master because of
delete queries that ran then empty replace queries like this: REPLACE INTO
`mydb`.`mytable`() VALUES ()/*!*/;

What information do you get from mk-<toolname>--version?
mk-table-sync  Ver 1.0.10 Distrib 2442 Changeset 2439

What is your MySQL, Perl, DBI, and DBD::mysql version?
Mysql: Master: 5.0.51a-9-log slave: 5.0.51a-15~bpo40+1-log
Perl: 5.8.8-7etch3
libdbi-perl: 1.53-1etch1
libdbd-mysql-perl 3.0008-1

What is your operating system/distribution?
Debian Etch 2.6.22-4-vserver-686

Please provide any additional information below.

Original issue reported on code.google.com by frero...@gmail.com on 6 Nov 2008 at 8:22

GoogleCodeExporter commented 9 years ago
We've got the same bug. Here is fragment from mk-table-sync debug output:

# ChangeHandler:1686 7319 DELETE where `id`=286
...
# ChangeHandler:1686 7319 INSERT where `id`=286
...
# ChangeHandler:1678 7319 Calling subroutines on DELETE FROM 
`twibo`.`ss_twibo_users`
WHERE `id`=286 LIMIT 1
# TableSyncer:3257 7319 About to execute: DELETE FROM `twibo`.`ss_twibo_users` 
WHERE
`id`=286 LIMIT 1
...
# ChangeHandler:1807 7319 Fetching data for UPDATE: SELECT * FROM
`twibo`.`ss_twibo_users` WHERE `id`=286 LIMIT 1
# ChangeHandler:1678 7319 Calling subroutines on REPLACE INTO
`twibo`.`ss_twibo_users`() VALUES ()
# TableSyncer:3257 7319 About to execute: REPLACE INTO 
`twibo`.`ss_twibo_users`()
VALUES ()

RHEL 5
Mysql 5.0.45-log
Perl 5.8.8

Original comment by a...@skoosh.com on 6 Nov 2008 at 9:08

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hello,

We have actually debugged this further and it turns out the problem lies in the
source and destination queries returning results in different order -- 
mk-table-sync
doesn't sort the results from the query that a plugin (algorithm) creates (in 
our
case that was Chunk). 

So, due to different order of the results at some point right side got a very 
high ID
from the query, while left side continued going through other smaller IDs for a 
long
time. All this time, false INSERT statements were added to the queue and the 
right
side stayed at the same number until left side was over. Then for all remaining 
rows
in the right side (mostly) false DELETE statements were added to the queue.

It looked like this:

# RowDiff:1289 16430 id is numeric, left - $VAR1 = {id => '90',__crc =>
'2391288557'};, right - $VAR1 = {id => '90',__crc => '2391288557'};
# RowDiff:1251 16430 Key comparison on left and right: 0
# RowDiff:1256 16430 Left and right have the same key
# RowDiff:1240 16430 Fetching row from left
# RowDiff:1244 16430 Fetching row from right
# TableSyncChunk:2775 16430 State 2, key cols id
# RowDiff:1278 16430 Comparing keys using columns id
# RowDiff:1289 16430 id is numeric, left - $VAR1 = {id => '91',__crc =>
'1543430793'};, right - $VAR1 = {id => '91',__crc => '1543430793'};
# RowDiff:1251 16430 Key comparison on left and right: 0
# RowDiff:1256 16430 Left and right have the same key
# RowDiff:1240 16430 Fetching row from left
# RowDiff:1244 16430 Fetching row from right
# TableSyncChunk:2775 16430 State 2, key cols id
# RowDiff:1278 16430 Comparing keys using columns id
# RowDiff:1289 16430 id is numeric, left - $VAR1 = {id => '367',__crc =>
'2011013830'};, right - $VAR1 = {id => '542',__crc => '1228036341'};
# RowDiff:1251 16430 Key comparison on left and right: -1
# RowDiff:1261 16430 Left is not in right
# TableSyncChunk:2775 16430 State 2, key cols id
# ChangeHandler:1688 16430 INSERT where `id`=367
# ChangeHandler:1714 16430 Queueing change for later
# RowDiff:1240 16430 Fetching row from left
# TableSyncChunk:2775 16430 State 2, key cols id
# RowDiff:1278 16430 Comparing keys using columns id
# RowDiff:1289 16430 id is numeric, left - $VAR1 = {id => '93',__crc =>
'357568227'};, right - $VAR1 = {id => '542',__crc => '1228036341'};
# RowDiff:1251 16430 Key comparison on left and right: -1
# RowDiff:1261 16430 Left is not in right
# TableSyncChunk:2775 16430 State 2, key cols id
# ChangeHandler:1688 16430 INSERT where `id`=93
# ChangeHandler:1714 16430 Queueing change for later

And so on. 542 in the right side kept repeating on and on while the left side 
was
going through other results.

We made a quick and dirty hack to make this work correctly at least for our 
table, after

      if ( $args{transaction} ) {
         if ( $change_dbh && $change_dbh eq $args{src_dbh} ) {
            $src_sql .= ' FOR UPDATE';
            $dst_sql .= ' LOCK IN SHARE MODE';
         }
         elsif ( $change_dbh ) {
            $src_sql .= ' LOCK IN SHARE MODE';
            $dst_sql .= ' FOR UPDATE';
         }
         else {
            $src_sql .= ' LOCK IN SHARE MODE';
            $dst_sql .= ' LOCK IN SHARE MODE';
         }
      }

We've added sorting like this:

      my $order_by = ' ORDER BY '.join(',', map { $plugin->{quoter}->quote($_).' ASC'
} @{$plugin->key_cols});
      $src_sql .= $order_by;
      $dst_sql .= $order_by;

So, it definitely looks sorting is missing from the algorithms in 
mt-table-sync. We
are not sure how this was supposed to work correctly without sorting.

We understand that the provided hack is probably not suited for all situations 
and
tables and so we would greatly appreciate if this problem can be fixed in the 
next
official release of Maatkit.

Many thanks.

Original comment by eyc...@gmail.com on 6 Nov 2008 at 11:50

GoogleCodeExporter commented 9 years ago
Indeed the algorithms assume that the data will be read in index order and that 
will
be good enough.  If an ORDER BY is specified, an external sort ("Using 
filesort") may
be necessary, and on large data sets that could be extremely expensive.

I think perhaps the best solution is to EXPLAIN the SELECT with the ORDER BY 
and make
sure there is no filesort.  Actually a filesort should be allowed for datasets 
whose
estimated size will be smaller than some default value, like 10MB.  If there is 
a
filesort and the 'rows' column from EXPLAIN, times the average row size 
provided by
SHOW TABLE STATUS, is larger than the limit, die with an error message like 
"filesort
of an estimate XYZ bytes would be needed; increase --filesortlimit to sync this 
table."

If you can provide the output of SHOW CREATE TABLE and the output of EXPLAIN, 
with
and without the ORDER BY, that would be a big help.  Please use the \G query
terminator so it doesn't turn into an exercise of unwrapping long lines in the 
bug
report :-)

Original comment by baron.schwartz on 7 Nov 2008 at 1:05

GoogleCodeExporter commented 9 years ago
Hello again.

The information you requested is below. The results of EXPLAIN are identical on 
the
master and the slave that we were syncing, I specially checked that. 

Also I wanted to mention that we wanted to sync the tables because some of the 
tables
in the slave database got corrupted by MySQL itself (they are all MyISAM), 
after the
tables were REPAIRed, MySQL reported that some of the rows are missing now in 
them.
Maybe because of this in our case the order of the records in the query was not 
as
you were expecting initially? Anyway, this problem leads to data being lost 
both on
master and slave if it happens (because of the replication mechanism used in 
Maatkit
to deliver the changes to the slaves), so it's pretty dangerous and it would be 
great
if this is addressed. 

Even if the only solution is to use ORDER BY which would be slow on big tables, 
we
would probably still prefer it, because our only alternative right now, when we 
want
to sync the data on slaves, is to stop our whole system, lock master, copy all 
7 Gb
of data and then deliver it to all slaves. Your way of doing syncing is 
certainly
much better, because our system keeps working at the time when the sync is 
being done
and it only syncs tables which have different data.

mysql> show create table ss_twibo_users\G
*************************** 1. row ***************************
       Table: ss_twibo_users
Create Table: CREATE TABLE `ss_twibo_users` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `group_id` tinyint(3) NOT NULL default '0',
  `first_name` varchar(50) NOT NULL default '',
  `last_name` varchar(50) NOT NULL default '',
  `email` varchar(100) NOT NULL default '',
  `timezone` float NOT NULL default '0',
  `p_full` tinyint(1) unsigned default NULL,
  `p_only_hotels` tinyint(1) unsigned default NULL,
  `p_only_affs` tinyint(1) unsigned default NULL,
  `s_email_changes` tinyint(1) unsigned default NULL,
  `position` varchar(50) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=590 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT `id`, CRC32(CONCAT_WS('#', `id`, `group_id`, `first_name`,
`last_name`, `email`, `timezone`, `p_full`, `p_only_hotels`, `p_only_affs`,
`s_email_changes`, `position`, CONCAT(ISNULL(`p_full`), ISNULL(`p_only_hotels`),
ISNULL(`p_only_affs`), ISNULL(`s_email_changes`), ISNULL(`position`)))) AS 
__crc FROM
`twibo`.`ss_twibo_users` USE INDEX (`PRIMARY`)  WHERE (1=1) AND (1=1)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ss_twibo_users
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 306
        Extra:
1 row in set (0.01 sec)

mysql> EXPLAIN SELECT `id`, CRC32(CONCAT_WS('#', `id`, `group_id`, `first_name`,
`last_name`, `email`, `timezone`, `p_full`, `p_only_hotels`, `p_only_affs`,
`s_email_changes`, `position`, CONCAT(ISNULL(`p_full`), ISNULL(`p_only_hotels`),
ISNULL(`p_only_affs`), ISNULL(`s_email_changes`), ISNULL(`position`)))) AS 
__crc FROM
`twibo`.`ss_twibo_users` USE INDEX (`PRIMARY`)  WHERE (1=1) AND (1=1) ORDER BY 
`id` ASC\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ss_twibo_users
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 306
        Extra: Using filesort
1 row in set (0.00 sec)

Original comment by eyc...@gmail.com on 7 Nov 2008 at 12:53

GoogleCodeExporter commented 9 years ago
I had a similar problem. At the time it happened i did not have the data to 
show.

My result information was different then described here, but the symptoms and 
results
sound the same. I will wait for the fix on this, and retry it on my database to 
see
if it solved my problem.

Original comment by tim...@gmail.com on 10 Dec 2008 at 7:21

GoogleCodeExporter commented 9 years ago
This issue is really the same as issue 71, so I'm merging it in and closing it 
as a
duplicate.  I would also note that I don't see us having time to fix this for 
free
anytime soon, so if it's important to you, you should consider either 
sponsoring us
to work on it for pay, or submit a well-tested patch yourself.

Original comment by baron.schwartz on 25 Dec 2008 at 10:09

GoogleCodeExporter commented 9 years ago
Is anyone else interested in sponsoring this bug? Please email me timhon [ at ] 
g mail [ dot ] com.

Original comment by tim...@gmail.com on 27 Dec 2008 at 8:10