Closed hidehalo closed 3 years ago
I think sql.DB
has a pool, we could just its method?
I think
sql.DB
has a pool, we could just its method?
@lance6716 yes, that will be super!
@hidehalo Some integration tests failed with:
[2020-12-03T07:53:43.121Z] [2020/12/03 15:53:40.521 +08:00] [ERROR] [main.go:91] ["tidb lightning encountered error stack info"] [error="restore view schema db0 failed: create table failed: write tcp 127.0.0.1:23883->127.0.0.1:4000: use of closed network connection"] [errorVerbose="write tcp 127.0.0.1:23883->127.0.0.1:4000: use of closed network connection\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:187\ngithub.com/pingcap/tidb-lightning/lightning/common.Retry\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:125\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.perform\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:110\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:185\ngithub.com/pingcap/tidb-lightning/lightning/glue.(*ExternalTiDBGlue).ExecuteWithLog\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/glue/glue.go:71\ngithub.com/pingcap/tidb-lightning/lightning/restore.InitSchema.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/restore/tidb.go:151\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncreate table failed\nrestore view schema db0 failed"]
You should properly handle the task exit order to avoid exec SQL after the sql.DB is closed
@kennytm What's next?
/run-all-tests
there's a failing CI https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/lightning_ghpr_test/detail/lightning_ghpr_test/3082/pipeline/52, you could try to debug it
there's a failing CI https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/lightning_ghpr_test/detail/lightning_ghpr_test/3082/pipeline/52, you could try to debug it
The error log is:
[2020-12-11T02:25:13.453Z] [2020/12/11 10:25:11.057 +08:00] [ERROR] [main.go:91] ["tidb lightning encountered error stack info"] [error="restore view schema v failed: create view failed: Error 1273: Unknown collation: ''"] [errorVerbose="Error 1273: Unknown collation: ''\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:187\ngithub.com/pingcap/tidb-lightning/lightning/common.Retry\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:125\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.perform\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:110\ngithub.com/pingcap/tidb-lightning/lightning/common.SQLWithRetry.Exec\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/common/util.go:185\ngithub.com/pingcap/tidb-lightning/lightning/glue.(*ExternalTiDBGlue).ExecuteWithLog\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/glue/glue.go:72\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*restoreSchemaWorker).run.func1\n\t/home/jenkins/agent/workspace/lightning_ghpr_test/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:414\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ncreate view failed\nrestore view schema v failed"]
I think this is caused by if a create table/view file contains multi statements, they should be executed in the same session in serial.
@glorv @lance6716 thx guys!
/run-all-tests
@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?
@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?
output failed on test various_types
because OUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0'
, where output is
[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types
[2020-12-13T11:18:02.436Z] *************************** 1. row ***************************
[2020-12-13T11:18:02.436Z] a: 18446744073709551614
[2020-12-13T11:18:02.436Z] b: -9223372036854775806
[2020-12-13T11:18:02.436Z] c: 99999999999999999999.0
[2020-12-13T11:18:02.436Z] d: 1.8e19
@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?
output failed on test
various_types
becauseOUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0'
, where output is[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types [2020-12-13T11:18:02.436Z] *************************** 1. row *************************** [2020-12-13T11:18:02.436Z] a: 18446744073709551614 [2020-12-13T11:18:02.436Z] b: -9223372036854775806 [2020-12-13T11:18:02.436Z] c: 99999999999999999999.0 [2020-12-13T11:18:02.436Z] d: 1.8e19
But I can not reproduce it on my local integration test, I am very confuse... Any ideas?
@glorv @lance6716 PTAL, CI test still fails, but my local integration test passed, there is no error message in the lighting.log. What is the reason?
output failed on test
various_types
becauseOUTPUT DOES NOT CONTAIN 'd: 18446744073709551616.0'
, where output is[2020-12-13T11:18:02.436Z] [Sun Dec 13 19:18:02 CST 2020] Executing SQL: SELECT a, b, c, d FROM vt.precise_types [2020-12-13T11:18:02.436Z] *************************** 1. row *************************** [2020-12-13T11:18:02.436Z] a: 18446744073709551614 [2020-12-13T11:18:02.436Z] b: -9223372036854775806 [2020-12-13T11:18:02.436Z] c: 99999999999999999999.0 [2020-12-13T11:18:02.436Z] d: 1.8e19
But I can not reproduce it on my local integration test, I am very confuse... Any ideas?
I'm checking if it's a BUG of TiDB of that version
Sorry it's a BUG in TiDB of git hash 06cd92e05f0dfff1a139c1e5baca2ee24fb387b2
, we might wait TiDB fixing
Sorry it's a BUG in TiDB of git hash
06cd92e05f0dfff1a139c1e5baca2ee24fb387b2
, we might wait TiDB fixing
Great job! I am fixing log message error(little fix). If you have time, please help me review the code, thx! xD
@lance6716 new CI error below:
[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]
It it happened when restore schema views from data source tests/view/data
. I believe the reason is restore job of view db0
.v0
does not wait view db1
.v1
restored. Maybe the restore job of views should not run concurrency 😕 ?
BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)
@lance6716 new CI error below:
[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]
It it happened when restore schema views from data source
tests/view/data
. I believe the reason is restore job of viewdb0
.v0
does not wait viewdb1
.v1
restored. Maybe the restore job of views should not run concurrency 😕 ?BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)
https://github.com/pingcap/tidb-lightning/pull/502#discussion_r534814710
please follow above comment, there are 3 stages that support internal concurrency, but the whole stage should be finished before enter next one
@lance6716 new CI error below:
[2020-12-15T10:17:15.144Z] [2020/12/15 18:17:12.472 +08:00] [ERROR] [restore.go:432] ["execute SQL: CREATE ALGORITHM = UNDEFINED DEFINER = `root`@`192.168.198.178` SQL SECURITY DEFINER VIEW `db0`.`v2` (`s`) AS SELECT `s` FROM `db1`.`v1` WHERE `i`<2; failed"] [table=`db0`.`v2`] [takeTime=744.654µs] [error="create view failed: Error 1146: Table 'db1.v1' doesn't exist"]
It it happened when restore schema views from data source
tests/view/data
. I believe the reason is restore job of viewdb0
.v0
does not wait viewdb1
.v1
restored. Maybe the restore job of views should not run concurrency 😕 ? BTW, I can not reproduce the error in my local integration_test(TEST_NAME=view)please follow above comment, there are 3 stages that support internal concurrency, but the whole stage should be finished before enter next one
Yes, I tried first. We found https://github.com/pingcap/tidb-lightning/pull/502#issuecomment-742934293.
Then I follow this comment hold whole statements of one dbMeta
, we have new discuz https://github.com/pingcap/tidb-lightning/pull/502#discussion_r542218427.
Now, the latest impl is, after db&table restore schema jobs done, restore view schema jobs would execute concurrency.
PS: statements of one restore view schema job executed in serial
I think we should not go back the step one for now, how do you think?
@glorv Join us please, when you have time.
@lance6716 new CI error, it's seems not a restore schema error? it need to check.
/run-all-tests
@lance6716 new CI error, it's seems not a restore schema error? it need to check.
Seems it's an unstable integration tests, I'll take a look at it.
/reward 600
This PR do not have any linked issue.
/reward 600
You are not the mentor for the linked issue.
wait the first mentor /reward
this PR @glorv , to let bot stop marking give-up every 7 days
/reward 600
Reward success.
/run-all-tests
almost LGTM.
jobCh
(I prefer the latter)errCh
could be replaced by an atomic variable to reduce complexity, like https://github.com/pingcap/tidb-lightning/blob/b23840d3fc3c61a2517e2cf97ed03b5fc302faf7/lightning/restore/restore.go#L839OK to not change if you reasonably explain current implementation
almost LGTM.
- worker may depend on only one way of exit (only considering errors caused by itself, not including parent context done): cancel function of context or closing
jobCh
(I prefer the latter)errCh
could be replaced by an atomic variable to reduce complexity, like https://github.com/pingcap/tidb-lightning/blob/b23840d3fc3c61a2517e2cf97ed03b5fc302faf7/lightning/restore/restore.go#L839OK to not change if you reasonably explain current implementation
For this implemented program, the core pattern is producer vs. consumer mode. The difference is that except using jobCh
as queue to deliver job's messages, I also use errCh
to deliver exception message to notify other procedures to exit. And The signal of context.Done()
is monitored in order to properly exit the program. Therefore, when errCh
or contex.Done()
channel receive a message, the entire program should be exited. In addition to delivering job messages, jobCh
also serves as a signal for consumer's procedures exit. When all messages of jobCh
are exhausted and the jobs are marked as completed by wg.Done()
, the program should exit normally by close the jobCh
channel. But unfortunately, all consumer's goroutines also will be terminated due to exceptions or context cancellation/timeout. At this moment, there may still reserve unconsumed messages in jobCh
, which have been marked as pending by wg.Add()
, so in the wait()
method a goroutine monitor waitCh
is added to prevent blocking forever. Correspondingly, in the consumer's procedure doJob()
, due to an exception or context cancellation/timeout, it will only interrupt the consumption of jobCh
messages and the spin cycle of job execution. Finally, all jobs marked as pending will be marked as completed. When all job messages are correctly consumed and execution is finished, producer's procedure makeJobs()
will return err
value and jobCh
will be closed, then all consumer's procedures will exit correctly. I think the current implementation is good enough with enough comments, there is no need to change the concurrency pattern.
LGTM
a few trivial improvement, rest LGTM
Add unit tests, PTAL.
LGTM
@hidehalo, Congratulations, you get 600 in this PR, and your total score is 600 in high-performance challenge program.
What problem does this PR solve?
Issue Number: close #434
What is changed and how it works?
schemaStmt
hold one statement(create db|table|view)schemaJob
holds whole statements of one restore schema jobrestoreSchemaWorker
produce a async goroutine to create restore schema jobs(as producer)"hardcoded"
(16 goroutines) concurrency whenrestoreSchema#doJob
called(as consumer)Benchmark
(tests/restore/run.sh $TABLE_COUNT=300) time costs report as below:
Before
After
PS: the benchmark occurs from non-cluster TiDB, it maybe means the only one node(TiDB) as both DDL owner/non-owner stuck the total thread. we should benchmark again in TiDB cluster(multiple DDL non-cluster)
-------- Update ---------
Benchmark 1 PD|3 TiDB| 4 * TiKV cluster (single machine)
preset:
Concurrency
Tests
Side effects
Related changes