pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
880 stars 244 forks source link

Do not reload table schema when update catalog cache #2667

Closed shiyuhang0 closed 1 year ago

shiyuhang0 commented 1 year ago

What problem does this PR solve?

TiSpark will reload all table schemas when init catalogCache. It will also reload all table schemas once we call listTable or getTable with the invalid cache.

One user reports that TiSpark hangs with more than 10000+ table schemas.

What is changed and how it works?

Add a configuration spark.tispark.load_tables , default is true. when set it to false, we will not load all tables when we reload the cache. table cache will only be loaded when the table is used.

Check List

Tests

ti-chi-bot commented 1 year ago

[REVIEW NOTIFICATION]

This pull request has been approved by:

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment. After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review. Reviewer can cancel approval by submitting a request changes review.
shiyuhang0 commented 1 year ago

@birdstorm PTAL It is a simple PR.

I think we can just reload db cache when the schema version has changed. GetTable or ListTable method will find the table cache is empty, then just load all tables in its' own db rather than all dbs. These will reduce the pressure on users with a large number of tables.

I find you use relaod(true) which will reload all table scehmas in the following PR. I am afraid I miss something which needs it. https://github.com/pingcap/tispark/pull/1623 https://github.com/pingcap/tispark/pull/2106

could you please take a look at this PR:

  1. Is there anything I miss? Can I just reload without reloading the table cache?
  2. Do you think we need a configuration to determine whether load tables cache?
birdstorm commented 1 year ago

@birdstorm It is a simple PR.

I think we can just reload db cache when the schema version has changed. GetTable or ListTable method will find the table cache is empty, then just load all tables in its' own db rather than all dbs. These will reduce the pressure on users with a large number of tables.

I find you use relaod(true) which will reload all table scehmas in the following PR. I am afraid I miss something which needs it. https://github.com/pingcap/tispark/pull/1623 https://github.com/pingcap/tispark/pull/2106

could you please take a look at this PR:

  1. Is there anything I miss? Can I just reload without reloading the table cache?
  2. Do you think we need a configuration to determine whether load tables cache?

@shiyuhang0 have you tested the case where there is rapid change in the schema version? e.g., Many tables are created and deleted at the same time, while the queries randomly occur on those tables? We added the function to reload catalog on tables because when we tested previously, it is possible for some of the queries to return “table not exist“ error even if the tables exist.

birdstorm commented 1 year ago

In theory, we do not need to reload tables every time because we should be able to resolve the ”table not exist” error in our error handler. It is just a reminder that the mentioned case should be tested and I will approve this PR considering that.

ti-chi-bot commented 1 year ago

@birdstorm: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to [this](https://github.com/pingcap/tispark/pull/2667#pullrequestreview-1354277443): >LGTM Instructions for interacting with me using PR comments are available [here](https://prow.tidb.net/command-help). If you have questions or suggestions related to my behavior, please file an issue against the [ti-community-infra/tichi](https://github.com/ti-community-infra/tichi/issues/new?title=Prow%20issue:) repository.
shiyuhang0 commented 1 year ago

@birdstorm It is a simple PR. I think we can just reload db cache when the schema version has changed. GetTable or ListTable method will find the table cache is empty, then just load all tables in its' own db rather than all dbs. These will reduce the pressure on users with a large number of tables. I find you use relaod(true) which will reload all table scehmas in the following PR. I am afraid I miss something which needs it.

1623

2106

could you please take a look at this PR:

  1. Is there anything I miss? Can I just reload without reloading the table cache?
  2. Do you think we need a configuration to determine whether load tables cache?

@shiyuhang0 have you tested the case where there is rapid change in the schema version? e.g., Many tables are created and deleted at the same time, while the queries randomly occur on those tables? We added the function to reload catalog on tables because when we tested previously, it is possible for some of the queries to return “table not exist“ error even if the tables exist.

So, these PRs is aiming to fix the table cache miss (mostly because of the race I think). But, I can not find where is the race in the code :( I think it is necessary to have a test. Is the “table not exist“ easy to reproduce? I need to know if the race will not happen anymore or it is just not be reproduced

birdstorm commented 1 year ago

@birdstorm It is a simple PR. I think we can just reload db cache when the schema version has changed. GetTable or ListTable method will find the table cache is empty, then just load all tables in its' own db rather than all dbs. These will reduce the pressure on users with a large number of tables. I find you use relaod(true) which will reload all table scehmas in the following PR. I am afraid I miss something which needs it.

1623

2106

could you please take a look at this PR:

  1. Is there anything I miss? Can I just reload without reloading the table cache?
  2. Do you think we need a configuration to determine whether load tables cache?

@shiyuhang0 have you tested the case where there is rapid change in the schema version? e.g., Many tables are created and deleted at the same time, while the queries randomly occur on those tables? We added the function to reload catalog on tables because when we tested previously, it is possible for some of the queries to return “table not exist“ error even if the tables exist.

So, these PRs is aiming to fix the table cache miss. But, I can not find where is the race in the code :( I think it is necessary to have a test. Is the “table not exist“ easy to reproduce? I need to know if the race will not happen anymore or it is just not be reproduced

It should be easy to reproduce if the race still exists. After we have fixed the region cache issue, it is possible that the problem has been resolved since then. In either case, I believe the case should be re-tested in this PR to ensure that.

shiyuhang0 commented 1 year ago

/run-all-tests

shiyuhang0 commented 1 year ago

/run-all-tests

shiyuhang0 commented 1 year ago

/run-all-tests

shiyuhang0 commented 1 year ago

/run-all-tests

shiyuhang0 commented 1 year ago

/run-all-tests

shiyuhang0 commented 1 year ago

/merge

ti-chi-bot commented 1 year ago

This pull request has been accepted and is ready to merge.

Commit hash: d2b72970eb4225450c23f957b2bc921d9f9d6357

ti-chi-bot commented 1 year ago

In response to a cherrypick label: new pull request created to branch release-3.1: #2676.

ti-chi-bot commented 1 year ago

In response to a cherrypick label: new pull request created to branch release-3.2: #2677.

ti-chi-bot commented 1 year ago

In response to a cherrypick label: new pull request created to branch release-3.0: #2678.