restore: Try to create tables in parallel

hidehalo commented 3 years ago

What problem does this PR solve?

Issue Number: close #434

What is changed and how it works?

add schemaStmt hold one statement(create db|table|view)
add schemaJob holds whole statements of one restore schema job
add restoreSchemaWorker produce a async goroutine to create restore schema jobs(as producer)
set "hardcoded"(16 goroutines) concurrency when restoreSchema#doJob called(as consumer)

Benchmark

(tests/restore/run.sh $TABLE_COUNT=300) time costs report as below:

Before


________________________________________________________
Executed in  211.51 secs   fish           external 
   usr time   76.28 secs  187.00 micros   76.28 secs 
   sys time   44.62 secs  617.00 micros   44.62 secs 

[2020/12/08 17:06:33.389 +08:00] [INFO] [restore.go:964] ["restore all tables data completed"] [takeTime=1m9.093660687s] []
[2020/12/08 17:06:33.389 +08:00] [INFO] [restore.go:745] ["everything imported, stopping periodic actions"]
[2020/12/08 17:06:33.389 +08:00] [INFO] [restore.go:1409] ["skip full compaction"]
[2020/12/08 17:06:33.411 +08:00] [INFO] [restore.go:294] ["the whole procedure completed"] [takeTime=1m45.325956477s] []
[2020/12/08 17:06:33.411 +08:00] [INFO] [main.go:95] ["tidb lightning exit"]
[2020/12/08 17:06:33.411 +08:00] [INFO] [checksum.go:425] ["service safe point keeper exited"]

After


________________________________________________________
Executed in  213.24 secs   fish           external 
   usr time   78.08 secs  140.00 micros   78.08 secs 
   sys time   44.92 secs  475.00 micros   44.92 secs 

[2020/12/08 16:55:15.821 +08:00] [INFO] [restore.go:820] ["restore all tables data completed"] [takeTime=1m9.754043571s] []
[2020/12/08 16:55:15.821 +08:00] [INFO] [restore.go:601] ["everything imported, stopping periodic actions"]
[2020/12/08 16:55:15.821 +08:00] [INFO] [restore.go:1265] ["skip full compaction"]
[2020/12/08 16:55:15.840 +08:00] [INFO] [restore.go:293] ["the whole procedure completed"] [takeTime=1m42.140242288s] []
[2020/12/08 16:55:15.840 +08:00] [INFO] [main.go:95] ["tidb lightning exit"]
[2020/12/08 16:55:15.840 +08:00] [INFO] [checksum.go:425] ["service safe point keeper exited"]

PS: the benchmark occurs from non-cluster TiDB, it maybe means the only one node(TiDB) as both DDL owner/non-owner stuck the total thread. we should benchmark again in TiDB cluster(multiple DDL non-cluster)

-------- Update ---------

Benchmark 1 PD|3 TiDB| 4 * TiKV cluster (single machine)

preset:

mysql> set @@global.tidb_scatter_region = "1";

Concurrency

[2020/12/29 14:55:09.523 +08:00] [INFO] [restore.go:503] ["restore schema completed"] [takeTime=2m15.052150251s] []

Serial

[2020/12/29 15:04:52.746 +08:00] [INFO] [restore.go:357] ["restore schema completed"] [takeTime=2m47.520433308s] []

Check List

Tests

Unit test
Integration test

Side effects

Increased code complexity

Related changes

Need to cherry-pick to the release branch
Need to be included in the release note

CLAassistant commented 3 years ago

All committers have signed the CLA.

lance6716 commented 3 years ago

I think sql.DB has a pool, we could just its method?

https://golang.org/pkg/database/sql/#DB.SetMaxIdleConns

hidehalo commented 3 years ago

I think sql.DB has a pool, we could just its method?

https://golang.org/pkg/database/sql/#DB.SetMaxIdleConns

@lance6716 yes, that will be super!

glorv commented 3 years ago

@hidehalo Some integration tests failed with:

[2020-12-03T07:53:43.121Z] [2020/12/03 15:53:40.521 +08:00] [ERROR] [main.go:91] ["tidb lightning encountered error stack info"] [error="restore view schema db0 failed: create table failed: write tcp 127.0.0.1:23883->127.0.0.1:4000: use of closed network connection"] [errorVerbose="write tcp 127.0.0.1:23883->127.0.0.1:4000: use of closed network connection\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:187\ngithub.com/pingcap/tidb-lightning/lightning/common.Retry\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:125\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.perform\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:110\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:185\ngithub.com/pingcap/tidb-lightning/lightning/glue.(*ExternalTiDBGlue).ExecuteWithLog\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/glue/glue.go:71\ngithub.com/pingcap/tidb-lightning/lightning/restore.InitSchema.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/restore/tidb.go:151\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncreate table failed\nrestore view schema db0 failed"]

You should properly handle the task exit order to avoid exec SQL after the sql.DB is closed

hidehalo commented 3 years ago

@kennytm What's next?

lance6716 commented 3 years ago

/run-all-tests

lance6716 commented 3 years ago

there's a failing CI https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/lightning_ghpr_test/detail/lightning_ghpr_test/3082/pipeline/52, you could try to debug it

glorv commented 3 years ago

there's a failing CI https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/lightning_ghpr_test/detail/lightning_ghpr_test/3082/pipeline/52, you could try to debug it

The error log is:

[2020-12-11T02:25:13.453Z] [2020/12/11 10:25:11.057 +08:00] [ERROR] [main.go:91] ["tidb lightning encountered error stack info"] [error="restore view schema v failed: create view failed: Error 1273: Unknown collation: ''"] [errorVerbose="Error 1273: Unknown collation: ''\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:187\ngithub.com/pingcap/tidb-lightning/lightning/common.Retry\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:125\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.perform\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:110\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:185\ngithub.com/pingcap/tidb-lightning/lightning/glue.(*ExternalTiDBGlue).ExecuteWithLog\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/glue/glue.go:72\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*restoreSchemaWorker).run.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:414\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncreate view failed\nrestore view schema v failed"]

I think this is caused by if a create table/view file contains multi statements, they should be executed in the same session in serial.

hidehalo commented 3 years ago

@glorv @lance6716 thx guys!

hidehalo commented 3 years ago

/run-all-tests

hidehalo commented 3 years ago

@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?

lance6716 commented 3 years ago

@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?

output failed on test various_types because OUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0', where output is

[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types
[2020-12-13T11:18:02.436Z] *************************** 1. row ***************************
[2020-12-13T11:18:02.436Z] a: 18446744073709551614
[2020-12-13T11:18:02.436Z] b: -9223372036854775806
[2020-12-13T11:18:02.436Z] c: 99999999999999999999.0
[2020-12-13T11:18:02.436Z] d: 1.8e19

hidehalo commented 3 years ago

@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?

output failed on test various_types because OUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0', where output is
[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types
[2020-12-13T11:18:02.436Z] *************************** 1. row ***************************
[2020-12-13T11:18:02.436Z] a: 18446744073709551614
[2020-12-13T11:18:02.436Z] b: -9223372036854775806
[2020-12-13T11:18:02.436Z] c: 99999999999999999999.0
[2020-12-13T11:18:02.436Z] d: 1.8e19

But I can not reproduce it on my local integration test, I am very confuse... Any ideas?

lance6716 commented 3 years ago

@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?

output failed on test various_types because OUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0', where output is
[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types
[2020-12-13T11:18:02.436Z] *************************** 1. row ***************************
[2020-12-13T11:18:02.436Z] a: 18446744073709551614
[2020-12-13T11:18:02.436Z] b: -9223372036854775806
[2020-12-13T11:18:02.436Z] c: 99999999999999999999.0
[2020-12-13T11:18:02.436Z] d: 1.8e19
But I can not reproduce it on my local integration test, I am very confuse... Any ideas?

I'm checking if it's a BUG of TiDB of that version

lance6716 commented 3 years ago

Sorry it's a BUG in TiDB of git hash 06cd92e05f0dfff1a139c1e5baca2ee24fb387b2, we might wait TiDB fixing

hidehalo commented 3 years ago

Sorry it's a BUG in TiDB of git hash 06cd92e05f0dfff1a139c1e5baca2ee24fb387b2, we might wait TiDB fixing

Great job! I am fixing log message error(little fix). If you have time, please help me review the code, thx! xD

hidehalo commented 3 years ago

@lance6716 new CI error below:

[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]

It it happened when restore schema views from data source tests/view/data. I believe the reason is restore job of view db0.v0 does not wait view db1.v1 restored. Maybe the restore job of views should not run concurrency 😕 ?

BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)

lance6716 commented 3 years ago

@lance6716 new CI error below:
[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]
It it happened when restore schema views from data source tests/view/data. I believe the reason is restore job of view db0.v0 does not wait view db1.v1 restored. Maybe the restore job of views should not run concurrency 😕 ?

BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)

https://github.com/pingcap/tidb-lightning/pull/502#discussion_r534814710

please follow above comment, there are 3 stages that support internal concurrency, but the whole stage should be finished before enter next one

hidehalo commented 3 years ago

@lance6716 new CI error below:
[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]
It it happened when restore schema views from data source tests/view/data. I believe the reason is restore job of view db0.v0 does not wait view db1.v1 restored. Maybe the restore job of views should not run concurrency 😕 ? BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)
#502 (comment)

please follow above comment, there are 3 stages that support internal concurrency, but the whole stage should be finished before enter next one

Yes, I tried first. We found https://github.com/pingcap/tidb-lightning/pull/502#issuecomment-742934293. Then I follow this comment hold whole statements of one dbMeta, we have new discuz https://github.com/pingcap/tidb-lightning/pull/502#discussion_r542218427. Now, the latest impl is, after db&table restore schema jobs done, restore view schema jobs would execute concurrency. PS: statements of one restore view schema job executed in serial I think we should not go back the step one for now, how do you think? @glorv Join us please, when you have time.

hidehalo commented 3 years ago

@lance6716 new CI error, it's seems not a restore schema error? it need to check.

glorv commented 3 years ago

/run-all-tests

glorv commented 3 years ago

@lance6716 new CI error, it's seems not a restore schema error? it need to check.

Seems it's an unstable integration tests, I'll take a look at it.

lance6716 commented 3 years ago

/reward 600

ti-challenge-bot[bot] commented 3 years ago

This PR do not have any linked issue.

Details

**Tip** : You need to ensure that the link description follows the following template: Issue Number: #xxx Issue Number: close #xxx About issue link, there is a trace [issue](https://github.com/tidb-community-bots/ti-challenge-bot/issues/178). **Warning**: None

lance6716 commented 3 years ago

/reward 600

ti-challenge-bot[bot] commented 3 years ago

You are not the mentor for the linked issue.

lance6716 commented 3 years ago

wait the first mentor /reward this PR @glorv , to let bot stop marking give-up every 7 days

glorv commented 3 years ago

/reward 600

ti-challenge-bot[bot] commented 3 years ago

Reward success.

hidehalo commented 3 years ago

/run-all-tests

lance6716 commented 3 years ago

almost LGTM.

worker may depend on only one way of exit (only considering errors caused by itself, not including parent context done): cancel function of context or closing jobCh (I prefer the latter)
errCh could be replaced by an atomic variable to reduce complexity, like https://github.com/pingcap/tidb-lightning/blob/b23840d3fc3c61a2517e2cf97ed03b5fc302faf7/lightning/restore/restore.go#L839

OK to not change if you reasonably explain current implementation

hidehalo commented 3 years ago

almost LGTM.

worker may depend on only one way of exit (only considering errors caused by itself, not including parent context done): cancel function of context or closing jobCh (I prefer the latter)

errCh could be replaced by an atomic variable to reduce complexity, like https://github.com/pingcap/tidb-lightning/blob/b23840d3fc3c61a2517e2cf97ed03b5fc302faf7/lightning/restore/restore.go#L839

OK to not change if you reasonably explain current implementation

For this implemented program, the core pattern is producer vs. consumer mode. The difference is that except using jobCh as queue to deliver job's messages, I also use errCh to deliver exception message to notify other procedures to exit. And The signal of context.Done() is monitored in order to properly exit the program. Therefore, when errCh or contex.Done() channel receive a message, the entire program should be exited. In addition to delivering job messages, jobCh also serves as a signal for consumer's procedures exit. When all messages of jobCh are exhausted and the jobs are marked as completed by wg.Done(), the program should exit normally by close the jobCh channel. But unfortunately, all consumer's goroutines also will be terminated due to exceptions or context cancellation/timeout. At this moment, there may still reserve unconsumed messages in jobCh, which have been marked as pending by wg.Add(), so in the wait() method a goroutine monitor waitCh is added to prevent blocking forever. Correspondingly, in the consumer's procedure doJob(), due to an exception or context cancellation/timeout, it will only interrupt the consumption of jobCh messages and the spin cycle of job execution. Finally, all jobs marked as pending will be marked as completed. When all job messages are correctly consumed and execution is finished, producer's procedure makeJobs() will return err value and jobCh will be closed, then all consumer's procedures will exit correctly. I think the current implementation is good enough with enough comments, there is no need to change the concurrency pattern.

lance6716 commented 3 years ago

LGTM

hidehalo commented 3 years ago

a few trivial improvement, rest LGTM

Add unit tests, PTAL.

glorv commented 3 years ago

LGTM

ti-challenge-bot[bot] commented 3 years ago

@hidehalo, Congratulations, you get 600 in this PR, and your total score is 600 in high-performance challenge program.

pingcap / tidb-lightning