Support for parallel copy (FR)

aadant commented 8 years ago

For big tables, it should be possible to copy the rows in parallel. It is even more obvious for partitioned tables + replication is now parallel. I propose this syntax : --copy-threads

shlomi-noach commented 8 years ago

gh-ost was in particular designed to to serialized copy to solve the locking problems introduced by the parallel copy.

I need a very strong incentive to parallelize again.

I'm unsure whether at this time partitioned tables actually allow concurrent writes to different partitions (didn't use to allow that).

For non-partitioned tables this is an absolute NO-GO. Parallel INSERTs into same table are again and again verified to be slower than sequential INSERTs. Potentially some UPDATEs might gain from this but that's not what this Issue is about.

Unless I see a strong case I'm going to label as WontFix.

aadant commented 8 years ago

pt-online-schema-change is also single-threaded. Single threaded performance is known to decrease with MySQL releases ! Multi-threaded performance does not. It is a pity that we have more and more hardware and we can not use it ! If InnoDB can not scale writes in parallel even on the same table, then there is an issue with InnoDB :-)

Partitioned tables do allow parallel writes for sure. There are hints for partitions for DML http://dev.mysql.com/doc/refman/5.6/en/partitioning-selection.html and most of the metadata required is in information_schema.partitions, similar estimates to tables. Partitioning clearly allows for divide and conquer copy.

shlomi-noach commented 8 years ago

pt-online-schema-change is also single-threaded.

On rowcopy, yes. However the triggers create parallelism, which cause locking: https://github.com/github/gh-ost/blob/master/doc/why-triggerless.md

Single threaded performance is known to decrease with MySQL releases !

Not on single table access; it's still faster than concurrent (I'm talking about bulk loads workload)

Partitioned tables do allow parallel writes for sure. There are hints for partitions for DML http://dev.mysql.com/doc/refman/5.6/en/partitioning-selection.html and most of the metadata required is in information_schema.partitions, similar estimates to tables. Partitioning clearly allows for divide and conquer copy.

This does not depict the full picture. At least on 5.5 (and as I recall also on 5.6, early releases), writing to a partition, even InnoDB one, would require traversing the entire partition set to attempt to acquire lock, which would not be acquired after all, because there would be no need to. But the traversal is there, and it is a serializing mechanism. This is not as severe as "only write to a single partition at any given time" but also not "no contention between writes to different partitions".

I'm happy to see benchmarks that show write throughout serial vs. parallel on partitioned tables. If these show significant improvement with parallelism, then I'm happy to pursue this.

dveeden commented 8 years ago

I agree that it would be very useful if it would actually works. But I also agree that this could make things more complicated (locking issues) and that it might not work as expected.

Maybe applying changes which are read from the binlog stream can be done in parallel or maybe grouped into one transaction (like group commit). Is there any group info available from the binlogs?

Another thing we could (but probably should not) do is to prefetch pages by selecting it. This would only speedup the writes on the master because the pages are already in the buffer pool.

We should not focus only on InnoDB. Running gh-ost against MyRocks, TokuDB or NDB might work and might be able to benefit from parallelism. (NDB might not be the best example here..) On the other hand: Getting gh-ost to work in a reliable and easy way on single threaded InnoDB should be the first priority...

For copying the table: Multiple INSERT INTO x SELECT FROM y... processes might not be faster than a single one. Both for partitioned and non-partitioned tables. Also the order here is quite strict and because it is in the same schema the 5.6 parallel replication won't work. I'm not sure if the 5.7 logical clock stuff will work. The MariaDB implementation might work with slave_parallel_mode=(optimistic|aggressive). But there is another trick which might work: Create individual tables for each partition and copy to these and then doing ALTER TABLE t1 EXCHANGE PARTITION p1 WITH TABLE _t1_p1_gho. Then the writes to to multiple tables, which might allow better performance for parallel replication and InnoDB. This won't work for some partitioning types.

shlomi-noach commented 8 years ago

But there is another trick which might work: Create individual tables for each partition and copy to these and then doing ALTER TABLE t1 EXCHANGE PARTITION p1 WITH TABLE _t1_p1_gho. Then the writes to to multiple tables, which might allow better performance for parallel replication and InnoDB. This won't work for some partitioning types.

This is certainly cool; it would leave the server in weird state if gh-ost crashes. I'd label this as "far away into the future" though, unless anyone wants to do that. And still I'd rather not-yet.

aadant commented 8 years ago

I tested the exchange on a big partitioned table (3 billion rows, 500Gb, 90 partitions) with conventional online alter and it went twice faster in 5.6 for full rebuild. One problem was exchanging back because the full table is fully read to see if it matches the partitioning rules. 5.7 has exchange without validation which makes the exchange nearly immediate though serialized due to MDL. If the exchange is done without validation, the gain is a factor 3-4. Still interesting, especially for testing.

About the "would leave the server in weird state if gh-ost crashes", it would be the case anyway ! You would have to drop everything that was in progress. There would be some cleanup if the server crashes just like for pt-online-schema-change,

I'd label this as "far away into the future" though, unless anyone wants to do that. And still I'd rather >not-yet.

Sure, this is certainly something that the community can contribute to if it is not on your road map ! I agree there are more important things to do.

shlomi-noach commented 8 years ago

I tested the exchange on a big partitioned table (3 billion rows, 500Gb, 90 partitions) with conventional online alter and it went twice faster in 5.6 for full rebuild. One problem was exchanging back because the full table is fully read to see if it matches the partitioning rules. 5.7 has exchange without validation which makes the exchange nearly immediate though serialized due to MDL. If the exchange is done without validation, the gain is a factor 3-4. Still interesting, especially for testing.

Sorry, I didn't follow through. What was the experiment exactly?

A factor of 3-4 reduction in runtime is very interesting and justifies such a development.

About the "would leave the server in weird state if gh-ost crashes", it would be the case anyway ! You would have to drop everything that was in progress. There would be some cleanup if the server crashes just like for pt-online-schema-change,

True.

Sure, this is certainly something that the community can contribute to if it is not on your road map !

It's just that it's a relatively big change right now to work on.

aadant commented 8 years ago

Sorry, I didn't follow through. What was the experiment exactly?

The parallel rebuild of a big partitioned table using exchange partitions :

if P is the number of partitions

exchange partitions with P empty tables
alter the P tables in parallel
alter the partitioned table (fast since empty)
exchange partitions with P tables

you can parallelize 2.

can't be but is immediate in 5.7 using the "without validation" option.

I am not sure we could apply this to gh-ost though but it works very well using online alters if you do not crash in the middle. So good for development.

shlomi-noach commented 8 years ago

Thanks for explaining. Very interesting.

valeriikravchuk commented 8 years ago

I'm happy to see benchmarks that show write throughout serial vs. parallel on partitioned tables. If these show significant improvement with parallelism, then I'm happy to pursue this.

I've set up the following primitive benchmark:

1. I've created a range-partitioned tables with 4 partitions and filled it with 400000 rows of data, equally spread over partitions:

mysql> create table trange2(id int auto_increment primary key,
-> c1 char(200)) engine=InnoDB
-> partition by range(id) (
-> partition p1 values less than (100000),
-> partition p2 values less than (200000),
-> partition p3 values less than (300000),
-> partition p4 values less than maxvalue);
Query OK, 0 rows affected (1.34 sec)

mysql> insert into trange2 values (1, repeat('a', 200));
Query OK, 1 row affected (0.09 sec)

mysql> insert into trange2(c1) select repeat('b', 200) from trange2;
Query OK, 1 row affected (0.09 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> insert into trange2(c1) select repeat('b', 200) from trange2;
Query OK, 2 rows affected (0.08 sec)
Records: 2 Duplicates: 0 Warnings: 0
...

mysql> insert into trange2(c1) select repeat('b', 200) from trange2;
Query OK, 262144 rows affected (14.48 sec)
Records: 262144 Duplicates: 0 Warnings: 0

mysql> select count(*) from trange2;
+----------+
| count(*) |
+----------+
| 524288 |
+----------+
1 row in set (2.08 sec)

mysql> select count(*) from trange2;
+----------+
| count(*) |
+----------+
| 400000 |
+----------+
1 row in set (2.69 sec)

mysql> select count(*) from trange2 partition (p1);
+----------+
| count(*) |
+----------+
| 99999 |
+----------+
1 row in set (1.27 sec)

mysql> select count(*) from trange2 partition (p2);
+----------+
| count(*) |
+----------+
| 100000 |
+----------+
1 row in set (0.48 sec)

mysql> select count(*) from trange2 partition (p3);
+----------+
| count(*) |
+----------+
| 100000 |
+----------+
1 row in set (0.53 sec)

mysql> select count(*) from trange2 partition (p4);
+----------+
| count(*) |
+----------+
| 100001 |
+----------+
1 row in set (0.04 sec)

2. For each follow test of data copying approach I drop and re-create similarly partitioned table:

mysql> drop table if exists trange3;
Query OK, 0 rows affected (1.32 sec)

mysql> create table trange3 like trange2;
Query OK, 0 rows affected (1.55 sec)

mysql> show create table trange3\G
*************************** 1. row ***************************
Table: trange3
Create Table: CREATE TABLE `trange3` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`c1` char(200) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (id)
(PARTITION p1 VALUES LESS THAN (100000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200000) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300000) ENGINE = InnoDB,
PARTITION p4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0.00 sec)

3. Then I tested INSERT ... SELECT single thread approach:

openxs@ao756:~/dbs/maria10.2$ time (mysql -uroot -proot -e'insert into trange3 select * from trange2' test)
mysql: [Warning] Using a password on the command line interface can be insecure.

real 0m31.986s
user 0m0.005s
sys 0m0.010s

4. Now, the same action with concurrent partitions access:

openxs@ao756:~/dbs/maria10.2$ time (for i in 1 2 3 4; do mysql -uroot -proot -e"insert into trange3 select * from trange2 partition (p$i)" test & done; wait)
mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: mysql: mysql: [Warning] Using a password on the command line interface can be insecure.
[Warning] Using a password on the command line interface can be insecure.
[Warning] Using a password on the command line interface can be insecure.

real 0m20.078s
user 0m0.032s
sys 0m0.017s

It works 1.5 times faster! This is on just 2 cores and HDD, but we had not been entirely I/O bound (from iostat):

avg-cpu: %user %nice %system %iowait %steal %idle
5,10 0,00 9,69 69,90 0,00 15,31

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 78,00 158,00 1248,00 10524,00 99,76 3,30 13,86 24,62 8,56 4,17 98,40
dm-0 0,00 0,00 82,00 152,00 1312,00 7452,00 74,91 3,69 15,95 24,78 11,18 4,26 99,60
dm-1 0,00 0,00 82,00 138,00 1312,00 7448,00 79,64 3,69 16,96 24,78 12,32 4,53 99,60
dm-2 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00

This 1.5 times difference is repeatable:

openxs@ao756:~/dbs/maria10.2$ time (mysql -uroot -proot -e'insert into trange3 select * from trange2' test)
mysql: [Warning] Using a password on the command line interface can be insecure.

real 0m30.781s
user 0m0.017s
sys 0m0.001s

openxs@ao756:~/dbs/maria10.2$ time (for i in 1 2 3 4; do mysql -uroot -proot -e"insert into trange3 select * from trange2 partition (p$i)" test & done; wait)
mysql: mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.
[Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.

real 0m20.281s
user 0m0.032s
sys 0m0.014s

Test was performed on  Percona Server 5.7.14-7 with binary logging enabled, running on Ubuntu 14.04 on my old primitive netbook with 2 cores:

openxs@ao756:~/dbs/maria10.2$ pt-summary
# Percona Toolkit System Summary Report ######################
Date | 2016-08-26 15:56:06 UTC (local TZ: EEST +0300)
Hostname | ao756
Uptime | 4 days, 8:32, 4 users, load average: 0,08, 0,15, 0,39
Platform | Linux
Release | Ubuntu 14.04.5 LTS (trusty)
Kernel | 3.13.0-92-generic
Architecture | CPU = 64-bit, OS = 64-bit
Threading | NPTL 2.19
SELinux | No SELinux detected
Virtualized | No virtualization detected
# Processor ##################################################
Processors | physical = 1, cores = 2, virtual = 2, hyperthreading = no
Speeds | 2x1500.000
Models | 2xIntel(R) Pentium(R) CPU 987 @ 1.50GHz
Caches | 2x2048 KB
...

shlomi-noach commented 8 years ago

@valeriikravchuk thank you for the benchmark!

Looking somewhat deeper into this, the code change would be non-trivial, and would consume some time. This feature is not on my immediate roadmap and will be pushed till a later time, unless someone wants to do it (and in which case, please consult with me beforehand -- thank you!)

pratik60 commented 7 years ago

@shlomi-noach 👋

Quick question regarding the statement below

For non-partitioned tables this is an absolute NO-GO. Parallel INSERTs into same table are again and again verified to be slower than sequential INSERTs.

Do you have any links which actually benchmark parallel inserts into the same table? We have a table with > 6 billion rows, and I estimate gh-ost to take around 3 weeks to run a migration on this table. Except sharding (which we are working towards), is there any way to speed up a gh-ost migration on such tables?

shlomi-noach commented 7 years ago

Do you have any links which actually benchmark parallel inserts into the same table?

I do not have public ones (I haven't searched). I've had multiple attempts in my past observing this.

We have a table with > 6 billion rows, and I estimate gh-ost to take around 3 weeks to run a migration on this table.

We just came out of a 15 day migration. By far, what pushed this migration's progress back was replication capacity. Not write optimizations on the master. Well, that's on our setup. If yours is anything like ours, I'd advise making replicas go faster...

The usual innodb_flush_logs_at_trx_commit=2, sync_binlog=0 would do the trick, while taking risk of lost data or lost replication coordinates on crash.

pratik60 commented 7 years ago

@shlomi-noach I ran this on aurora to test it out, in an idle db of db.r3.8xlarge

Single Inserts - 188 minutes for 100M rows Parallel (2x) Inserts - 109 minutes for 100M rows

In one parallel thread, I started from 0 million to 50 million, while in the other I did from 50 million, to 100 million. Now this could be because of how aws aurora scales which is a black box.

Just wanted to share results, for those curious.

shlomi-noach commented 7 years ago

@pratik60 thank you, very interesting!

druud62 commented 6 years ago

If the PK-ranges don't overlap, they will not split each other's pages.

github / gh-ost

Support for parallel copy (FR) #193