Bug Report: RENAME TABLE issue while executing Online DDL with --postpone-completion option using "Vitess" strategy

shrutishrm512 commented 1 year ago

Overview of the Issue

I am facing RENAME table issue while running Online DDl with "vitess" strategy with postpone completion option (Testing Postponed migrations : https://vitess.io/docs/16.0/user-guides/schema-changes/postponed-migrations/)

Alter is not getting completed on all the shards while doing cutover.

Reproduction Steps

Initiated online ddl- Alter on a table on a 2 shards cluster

mysql> set @@ddl_strategy='vitess --postpone-completion';
mysql> ALTER TABLE participants ADD COLUMN code bigint(20) DEFAULT NULL;
+--------------------------------------+
| uuid                                 |
+--------------------------------------+
| efb23f12_e822_11ed_bb83_0a629da756d6 |
+--------------------------------------+
1 row in set (0.00 sec)
mysql> show vitess_migrations like 'efb23f12_e822_11ed_bb83_0a629da756d6' \G
mysql> alter vitess_migration 'efb23f12_e822_11ed_bb83_0a629da756d6' complete;

Original Alter got completed successfully.

Reverted the above migration again with postpone completion option

mysql> set @@ddl_strategy='vitess --postpone-completion';
mysql> revert vitess_migration 'efb23f12_e822_11ed_bb83_0a629da756d6';
+--------------------------------------+
| uuid                                 |
+--------------------------------------+
| 559a2aea_e8c1_11ed_8a33_0a629da756d6 |
+--------------------------------------+
1 row in set (0.00 sec)

This revet migration got completed on both the shards, but still showing migration_status as running as complete command did not get fired
After firing mysql> alter vitess_migration '559a2aea_e8c1_11ed_8a33_0a629da756d6' complete; On 1 shard, migration got completed and on another shard, it got stucked with this error messsage Getting error while completing
```
ddl_action: alter
                    message: timeout for rename query: RENAME TABLE `participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl` TO `participants`, `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl`
```
As checked vtHold table was present in the shard, still it is not able to rename table and vttablet logs not showing proper reason for cut over to get failed

7.It is not even retrying again to rename the table, still showing same error message

I have executed same rename table query directly on the shard, it got executed. But still migration is in running state

*************************** 2. row ***************************
                             id: 18
                 migration_uuid: 559a2aea_e8c1_11ed_8a33_0a629da756d6
                       keyspace: keyspace_sharded
                          shard: -80
                   mysql_schema: vt_keyspace_sharded
                    mysql_table: participants
            migration_statement: revert vitess_migration 'efb23f12_e822_11ed_bb83_0a629da756d6'
                       strategy: online
                        options: --postpone-completion
                added_timestamp: 2023-05-02 13:44:18
            requested_timestamp: 2023-05-02 13:44:18
                ready_timestamp: NULL
              started_timestamp: 2023-05-02 13:44:19
             liveness_timestamp: 2023-05-02 15:48:58
            completed_timestamp: NULL
              cleanup_timestamp: NULL
               migration_status: running
                       log_path: 
                      artifacts: _vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856,_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl,
                        retries: 0
                         tablet: dc1-0000000200
                 tablet_failure: 0
                       progress: 0
              migration_context: vtgate:e4f72c22-e8c0-11ed-8a33-0a629da756d6
                     ddl_action: alter
                        message: timeout for rename query: RENAME TABLE `participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl` TO `participants`, `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl`
                    eta_seconds: -1
                    rows_copied: 0
                     table_rows: 25847091
              added_unique_keys: 0
            removed_unique_keys: 0
                       log_file: 
       retain_artifacts_seconds: 86400
            postpone_completion: 0
       removed_unique_key_names: 
dropped_no_default_column_names: 
          expanded_column_names: 
               revertible_notes: 
               allow_concurrent: 0
                  reverted_uuid: efb23f12_e822_11ed_bb83_0a629da756d6
                        is_view: 0
              ready_to_complete: 1
                 stowaway_table: 
      vitess_liveness_indicator: 1683022736
            user_throttle_ratio: 0
                   special_plan: 
       last_throttled_timestamp: NULL
            component_throttled: 
            cancelled_timestamp: NULL
                postpone_launch: 0
                          stage: validating rename is still in place
               cutover_attempts: 1
         is_immediate_operation: 0
             reviewed_timestamp: 2023-05-02 13:44:19
2 rows in set (0.00 sec)

### Binary Version

```sh
Vitess Version: 16.0.1 (Git revision d1ba6258ea2462d5d28d67661aace7b79bb7e27b branch 'HEAD')

Operating System and Environment details

CentOS

Log Fragments

E0502 15:49:05.469918    4614 dbconn.go:436] Could not kill query ID 21 : Unknown thread id: 21 (errno 1094) (sqlstate HY000) during query: kill 21
E0502 15:49:05.472017    4614 executor.go:3630] cutOverVReplMigration failed: err=Code: ABORTED
timeout for rename query: RENAME TABLE `txn_participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl` TO `participants`, `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl`
E0502 15:49:05.472031    4614 executor.go:3957] Code: ABORTED
timeout for rename query: RENAME TABLE `participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_202client_loop: send disconnect: Broken pipe `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_2023050118'

shlomi-noach commented 1 year ago

This revet migration got completed on both the shards, but still showing migration_status as running as complete command did not get fired

@shrutishrm512 can you please clarify, when you say the migration got completed -- was the table schema for table participants revert to before having the new code column?

7.It is not even retrying again to rename the table, still showing same error message

It looks to me like the revert migration was not completed. What is the table schema at this time?

shrutishrm512 commented 1 year ago

9. This revet migration got completed on both the shards, but still showing migration_status as running as complete command did not get fired

@shrutishrm512 can you please clarify, when you say the migration got completed -- was the table schema for table participants revert to before having the new code column?

7.It is not even retrying again to rename the table, still showing same error message

It looks to me like the revert migration was not completed. What is the table schema at this time?

@shlomi-noach It was showing progress as 100 and read_to_complete as 1, but migration_status is running. Table schema was not reverted to old one. Yes, it is not completed on that shard, got failed while renaming the table. Table Schema was at that time

CREATE TABLE `participants` (
  `id` bigint NOT NULL,
  `name` varchar(100) DEFAULT NULL,
  `user_id` bigint DEFAULT NULL,
  `address_type` varchar(50) DEFAULT NULL,
  `created_on` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  `updated_on` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
  `code`  bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_participants_created_on` (`created_on`),
  KEY `idx_participants_updated_on` (`updated_on`),
  KEY `idx_user_id_created_on` (`user_id`,`created_on`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

shlomi-noach commented 1 year ago

OK, thank you, this is less alarming to me now. So both migrations are incomplete, still running, but repeatedly unable to cut-over. Sorry I didn't ask before -- but can you look at the vttablet logs for the PRIMARY (either one of the shards where it fails) and look for the entries around that failure please?

shrutishrm512 commented 1 year ago

OK, thank you, this is less alarming to me now. So both migrations are incomplete, still running, but repeatedly unable to cut-over. Sorry I didn't ask before -- but can you look at the vttablet logs for the PRIMARY (either one of the shards where it fails) and look for the entries around that failure please?

Primary vttablet logs are present in log fragments.

shlomi-noach commented 1 year ago

Not sure how do you mean "present in log fragments"?

shlomi-noach commented 1 year ago

If logs are available from around the cut-over time where the RENAME fails, can you please paste them?

shlomi-noach commented 1 year ago

Primary vttablet logs are present in log fragments.

Oh you mean in the section ":Log Fragments" in the original comment. Thank you. I'm looking for a bit more around that time, if there's anything at all? Also, I presume these logs are recurring? I expect the error to reappear every minute or so in the logs?

shrutishrm512 commented 1 year ago

Primary vttablet logs are present in log fragments.

Oh you mean in the section ":Log Fragments" in the original comment. Thank you. I'm looking for a bit more around that time, if there's anything at all? Also, I presume these logs are recurring? I expect the error to reappear every minute or so in the logs?

Actually, from logs it looks like it had made only two attempts. There were no logs after this timestamp"15:49:05.472031" Below logs were only visible related to the migration.

E0502 15:49:05.469918    4614 dbconn.go:436] Could not kill query ID 21 : Unknown thread id: 21 (errno 1094) (sqlstate HY000) during query: kill 21
E0502 15:49:05.472017    4614 executor.go:3630] cutOverVReplMigration failed: err=Code: ABORTED
timeout for rename query: RENAME TABLE `txn_participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl` TO `participants`, `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_20230501185027_vrepl`
E0502 15:49:05.472031    4614 executor.go:3957] Code: ABORTED
timeout for rename query: RENAME TABLE `participants` TO `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856`, `_efb23f12_e822_11ed_bb83_0a629da756d6_202client_loop: send disconnect: Broken pipe `_vt_HOLD_bef655f6e8d211eda1e90292d32412fc_20230503101856` TO `_efb23f12_e822_11ed_bb83_0a629da756d6_2023050118'

shlomi-noach commented 1 year ago

OK, thank you. I'm going to need more time to reproduce this. The main question is not why the RENAME timed out, but rather why wasn't there a retry.

shlomi-noach commented 1 year ago

Actually, the RENAME apparently never took place. The timeout is with waiting for the RENAME to appear in processlist.

shlomi-noach commented 1 year ago

To reproduce the issue I'll just run an early return before even waiting for processlist, and see whether there's no cut-over retries, and if so, why.

shlomi-noach commented 1 year ago

I'm unable to reproduce. In my experiments, the Online DDL executor correctly retries the migration even if the RENAME statement times out. I am not sure why in the original issue's scenario the migration was not retried.

shrutishrm512 commented 1 year ago

I'm unable to reproduce. In my experiments, the Online DDL executor correctly retries the migration even if the RENAME statement times out. I am not sure why in the original issue's scenario the migration was not retried.

Let me try reproduce this issue agin, will post it here again.

vitessio / vitess