Cli(ticdc): Add safepoint command support

Is your feature request related to a problem?

At a certain moment, the GC safe point is at 10, and a changefeed has synced to 20. At this point, the changefeed fails, interrupting the changefeed and leaving a service safe point with a value of 20 and a TTL of 24 hours. After a short while, the service safe point of the GC worker is advancing normally and the GC safe point is at 30. The changefeed service safe point blocks GC, which has expired beyond the 24-hour TTL. The GC lifetime is manually extended in an attempt to recover the changefeed, but it is determined that blocking GC has failed, making it unsafe to start the changefeed after 24-hour TTL.

We can't stop GC when updating tidb_gc_life_time in some cases, it is related to gc-ttl. So we have to create a new command to prolong changefeed's TTL to avoid data loss.

Describe the feature you'd like

We can add a servicesafepoint with changefeed's safepoint. This new servicesafepoint will prolong TTL to prevent GC. We also can delete this servicesafepoint as soon as possible when it is not necessary. And it's great that we can query all servicesafepoints.

We create a new cli command cdc cli safepoint to fix.

cdc cli safepoint set: create a user defined service safe point to block GC.
cdc cli safepoint delete: delete a user defined service safe point.
cdc cli safepoint query: query all service safe points.

When a changefeed fails, users can use cdc cli safepoint set command to create a new service safe point with a long TTL to block GC. They can use cdc cli safepoint delete command to delete the service safe point when the bug is fixed, and use cdc cli safepoint query command to check whether the related operations have been correctly implemented.

Notice: Users can only create and delete their definition of safepoint
Set
cdc cli safepoint set service-id-suffix=xxx start-ts=xxx ttl=xxx
# {
#   "service_gc_safe_points": [
#     {
#       "service_id": "gc_worker",
#       "expired_at": 9223372036854775807,
#       "safe_point": 451519635657850880
#     },
#     {
#       "service_id": "ticdc-default-15674009460217235928",
#       "expired_at": 1722651185,
#       "safe_point": 451560023174938623
#     }
#   ],
#   "min_service_gc_safe_point": 451519635657850880,
#   "gc_safe_point": 451519635657850880
# }
service-id-suffix: This is used to specify the service ID for the user-generated service safe point. TiCDC will generate a service ID in the format of ticdc-clusterID-etcdClusterID. We will append the suffix to this service ID to create a new service ID ticdc-clusterID-etcdClusterID-service-id-suffix, with a default value of "user-defined."

start-ts: This serves as the timestamp for the safe point that needs to be held. This value must be greater than or equal to minServiceSafePoint; otherwise, an error will be reported.

ttl: This updates the protection period and cannot be less than or equal to 0. The default value is 86400 seconds (24 hours).

Delete

cdc cli safepoint delete serviceIDsuffix start-ts=xxx
# equal `cdc cli safepoint set serviceID start-ts=xxx ttl=0`
# {
#   "service_gc_safe_points": [
#     {
#       "service_id": "gc_worker",
#       "expired_at": 9223372036854775807,
#       "safe_point": 451519635657850880
#     },
#     {
#       "service_id": "ticdc-default-15674009460217235928",
#       "expired_at": 1722651185,
#       "safe_point": 451560023174938623
#     }
#   ],
#   "min_service_gc_safe_point": 451519635657850880,
#   "gc_safe_point": 451519635657850880
# }

Query

cdc cli safepoint query --pd http://localhost:2379 [--cdc]
# {
#   "service_gc_safe_points": [
#     {
#       "service_id": "gc_worker",
#       "expired_at": 9223372036854775807,
#       "safe_point": 451519635657850880
#     },
#     {
#       "service_id": "ticdc-default-15674009460217235928",
#       "expired_at": 1722651185,
#       "safe_point": 451560023174938623
#     }
#   ],
#   "min_service_gc_safe_point": 451519635657850880,
#   "gc_safe_point": 451519635657850880
# }

API

GET /api/v2/safepoint: This is equivalent to the query operation, retrieving the safe point.
POST /api/v2/safepoint: This is equivalent to the set operation, setting the safe point and TTL for the given service ID.
DELETE /api/v2/safepoint: This is equivalent to the delete operation, removing the safe point for the specified service ID.

Describe alternatives you've considered

The changefeed needs to retain data beyond 20, but the 24-hour protection period has already expired. The GC worker only needs to retain data beyond 30, so it sets its own service safe point to 30. This means that the data between 20 and 30 has been nominally abandoned by the GC worker, although the actual deletion operation has not been executed.

Teachability, Documentation, Adoption, Migration Strategy

A document about safepoint will add after this feature.

pingcap / tiflow