pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
880 stars 243 forks source link

High` getregion` operations may cause TiSpark job failed with EpochNotmach #2760

Open King-Dylan opened 10 months ago

King-Dylan commented 10 months ago

Enhancement

For a cluster with huge number of regions, the getregion operation may cause the region cache to fail to be updated in time.Then Tispark may failed with EpochNotmach error when it need to acquire the new split region. High getregion operations cause high pd cpu usage. 7e09efd5-93d0-4a7b-bf5a-f5106446f169 Then region heart can't be processed in time because of the high pd cpu usage and get region lock.If a region is split and merged at this time, an error may occur:"EpochNotMatch current epoch of region xxxxx is conf_ver: 362996 version: 168541, but you sent conf_ver: 362996 version: 168540"

image