Open Lily2025 opened 1 year ago
/remove-area dm /area ticdc
/type question
@Lily2025: These labels are not set on the issue: type/bug
.
from @asddongmen, it was by designed First of all, after the error injection, the connection between cdc and pd is disconnected and the session times out. In this case, capture fails. The capture thread will automatically try to rebuild. During the reconstruction process, PD will be connected several times, and some timeouts will be set within cdc. If this timeout is set within the cdc repository, it will only cause the capture thread to retry again. The reason for the restart is that capture attempts to reconnect PD and passes some operations, and reaches the PDClient.GetTs interface. This interface sets a timeout period by itself, which will return an error not handled by cdc. This error is thrown outside the capture thread, causing a restart at the server process level.
Why I'm not going to add this error to the error list of internal retries of the capture thread: Because it is actually difficult to get to this step, plus it also represents the connection problem between CDC and PD, at this time, directly restart the process and refresh the memory state may recover from the error faster. (The log also proves that the server recovered immediately after restart without causing any worse phenomena, so I think it is a good choice to keep the status quo)
@Lily2025 What is the CDC lag in this scenario?
from @asddongmen, it was by designed First of all, after the error injection, the connection between cdc and pd is disconnected and the session times out. In this case, capture fails. The capture thread will automatically try to rebuild. During the reconstruction process, PD will be connected several times, and some timeouts will be set within cdc. If this timeout is set within the cdc repository, it will only cause the capture thread to retry again. The reason for the restart is that capture attempts to reconnect PD and passes some operations, and reaches the PDClient.GetTs interface. This interface sets a timeout period by itself, which will return an error not handled by cdc. This error is thrown outside the capture thread, causing a restart at the server process level.
Why I'm not going to add this error to the error list of internal retries of the capture thread: Because it is actually difficult to get to this step, plus it also represents the connection problem between CDC and PD, at this time, directly restart the process and refresh the memory state may recover from the error faster. (The log also proves that the server recovered immediately after restart without causing any worse phenomena, so I think it is a good choice to keep the status quo)
@Lily2025 Thank you for your detailed feedback. I will review the code to determine if it's feasible.
What did you do?
1、run workload storage:"s3://benchmark/sysbench_64_7000w" subType:"oltp_read_write" db:"sysbench_64_7000w" tableNum:64 tableSize:70000000 threads:32 ignoreErrors:"2013,1213,1105,1205,8022,8028,9004,9007,1062" 2、inject network partition between one of ticdc and other pods
What did you expect to see?
What did you see instead?
1、one of ticdc restart
Versions of the cluster
git hash:25ce29c2a1802bbb4cd26008f322728959a91f7a