github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.43k stars 1.26k forks source link

gh-ost exits after issuing cutover #574

Open raniejade opened 6 years ago

raniejade commented 6 years ago

Probably related to #403. Upon issuing a cut-over, gh-ost exits after several attempts to wait for AllEventsUpToLockProcessed event to appear. Have verified that AllEventsUpToLockProcessed is indeed written to the change log table.

This is how we run gh-ost.

real_migrate() {
  local dryrun=$1;
  echo "--- Real migration, dryrun = ${dryrun}"

  if [[ "${dryrun}" == false && ! -f "${CUTOVER_FILE}" ]]; then
    echo "--- Writing cutover flag file to ${CUTOVER_FILE}. "
    echo "--- Once migration is complete, rm this file to perform cutover"
    touch "${CUTOVER_FILE}"
  fi

  local ghost_opts=( --assume-rbr --verbose --skip-foreign-key-checks \
    --user="${DB_USER}" --password="${DB_PASSWORD}" \
    --host="${DB_MIGRATION_REPLICA}" --database="${DB_NAME}" --table="${DB_TABLE}" \
    --alter="${ALTER_STMT}" --assume-master-host="${DB_MASTER}"  --exact-rowcount \
    --concurrent-rowcount --initially-drop-ghost-table \
    --discard-foreign-keys --hooks-path="${HOOKS_PATH}" \
    --postpone-cut-over-flag-file="${CUTOVER_FILE}" \
    --throttle-flag-file="${THROTTLE_FILE}" \
    --heartbeat-interval-millis=200 --max-lag-millis=500 \
    --throttle-control-replicas="${DB_THROTTLE_CTL_REPLICA_0},${DB_THROTTLE_CTL_REPLICA_1},${DB_THROTTLE_CTL_REPLICA_2}" \
    --panic-flag-file="${PANIC_FILE}" )

  if [[ "${dryrun}" == false ]]; then
    ghost_opts+=( --execute )
  fi

  echo "--- Running ghost with params ${ghost_opts[@]}"
  ./gh-ost "${ghost_opts[@]}"
  echo "--- All done"
}
...

And the last few lines of the log file.

2018-04-18 06:22:26 INFO Tables unlocked
Copy: 874748309/874748309 100.0%; Applied: 4818139; Backlog: 0/1000; Time: 54h26m55s(total), 50h51m33s(copy); streamer: mysql-bin-changelog.074245:62075216; State: throttled, lag=1.214406s; ETA: due
2018-04-18 06:22:30 INFO Grabbing voluntary lock: gh-ost.8823949.lock
2018-04-18 06:22:30 INFO Setting LOCK timeout as 6 seconds
2018-04-18 06:22:30 INFO Looking for magic cut-over table
2018-04-18 06:22:30 INFO Creating magic cut-over table `media`.`_media_bundle_del`
2018-04-18 06:22:30 INFO Magic cut-over table created
2018-04-18 06:22:30 INFO Locking `media`.`media_bundle`, `media`.`_media_bundle_del`
2018-04-18 06:22:31 INFO Tables locked
2018-04-18 06:22:31 INFO Session locking original & magic tables is 8823949
2018-04-18 06:22:31 INFO Writing changelog state: AllEventsUpToLockProcessed:1524032551099143693
2018-04-18 06:22:31 INFO Waiting for events up to lock
Copy: 874748309/874748309 100.0%; Applied: 4818139; Backlog: 0/1000; Time: 54h27m0s(total), 50h51m33s(copy); streamer: mysql-bin-changelog.074245:62075216; State: migrating; ETA: due
2018-04-18 06:22:34 ERROR Timeout while waiting for events up to lock
2018-04-18 06:22:34 ERROR 2018-04-18 06:22:34 ERROR Timeout while waiting for events up to lock
2018-04-18 06:22:34 INFO Looking for magic cut-over table
2018-04-18 06:22:34 INFO Will now proceed to drop magic table and unlock tables
2018-04-18 06:22:34 INFO Dropping magic cut-over table
2018-04-18 06:22:34 INFO Releasing lock from `media`.`media_bundle`, `media`.`_media_bundle_del`
2018-04-18 06:22:34 INFO Tables unlocked
2018-04-18 06:22:35 INFO Grabbing voluntary lock: gh-ost.8823871.lock
2018-04-18 06:22:35 INFO Setting LOCK timeout as 6 seconds
2018-04-18 06:22:35 INFO Looking for magic cut-over table
2018-04-18 06:22:35 INFO Creating magic cut-over table `media`.`_media_bundle_del`
2018-04-18 06:22:35 INFO Magic cut-over table created
2018-04-18 06:22:35 INFO Locking `media`.`media_bundle`, `media`.`_media_bundle_del`
2018-04-18 06:22:35 INFO Tables locked
2018-04-18 06:22:35 INFO Session locking original & magic tables is 8823871
2018-04-18 06:22:35 INFO Writing changelog state: AllEventsUpToLockProcessed:1524032555651985046
2018-04-18 06:22:35 INFO Waiting for events up to lock
Copy: 874748309/874748309 100.0%; Applied: 4818139; Backlog: 0/1000; Time: 54h27m5s(total), 50h51m33s(copy); streamer: mysql-bin-changelog.074245:62075216; State: migrating; ETA: due
2018-04-18 06:22:38 ERROR Timeout while waiting for events up to lock
2018-04-18 06:22:38 ERROR 2018-04-18 06:22:38 ERROR Timeout while waiting for events up to lock
2018-04-18 06:22:38 INFO Looking for magic cut-over table
2018-04-18 06:22:38 INFO Will now proceed to drop magic table and unlock tables
2018-04-18 06:22:38 INFO Dropping magic cut-over table
2018-04-18 06:22:38 INFO Removing socket file: /tmp/gh-ost.media.media_bundle.sock
2018-04-18 06:22:38 INFO Tearing down inspector
2018-04-18 06:22:38 FATAL 2018-04-18 06:22:38 ERROR Timeout while waiting for events up to lock

We are using MySQL 5.6.34 on RDS, 1 master and 3 read replicas plus 1 more replica dedicated for the gh-ost migration.

We are using the default value for CutOverLockTimeoutSeconds (which is 3 seconds), but I don't think the timeout being too low is the issue. I did some digging and looks like gh-ost doesn't receive any bin_log events after issuing the cutover. Based on the snippet below I should see a line similar to 'Intercepted changelog state ' in the log file, but grepping only yields Intercepted changelog state GhostTableMigrated. Disclaimer: My go-fu is so-so, I may have misread some parts of the code.

// onChangelogStateEvent is called when a binlog event operation on the changelog table is intercepted.
func (this *Migrator) onChangelogStateEvent(dmlEvent *binlog.BinlogDMLEvent) (err error) {
    // Hey, I created the changelog table, I know the type of columns it has!
    if hint := dmlEvent.NewColumnValues.StringColumn(2); hint != "state" {
        return nil
    }
    changelogStateString := dmlEvent.NewColumnValues.StringColumn(3)
    changelogState := ReadChangelogState(changelogStateString)
    log.Infof("Intercepted changelog state %s", changelogState)
    switch changelogState {
    case GhostTableMigrated:
        {
            this.ghostTableMigrated <- true
        }
    case AllEventsUpToLockProcessed:
        {
            var applyEventFunc tableWriteFunc = func() error {
                this.allEventsUpToLockProcessed <- changelogStateString
                return nil
            }
            // at this point we know all events up to lock have been read from the streamer,
            // because the streamer works sequentially. So those events are either already handled,
            // or have event functions in applyEventsQueue.
            // So as not to create a potential deadlock, we write this func to applyEventsQueue
            // asynchronously, understanding it doesn't really matter.
            go func() {
                this.applyEventsQueue <- newApplyEventStructByFunc(&applyEventFunc)
            }()
        }
    default:
        {
            return fmt.Errorf("Unknown changelog state: %+v", changelogState)
        }
    }
    log.Infof("Handled changelog state %s", changelogState)
    return nil
}

Tried to run the same migration twice (currently retrying for the 3rd time) and the result is exactly the same. We already have run several successful migrations before on the same table and others.

shlomi-noach commented 6 years ago

Thank you for the detailed report. I’ll look into it.

raniejade commented 6 years ago

It finally worked, for some reason.

Copy: 883626763/883626763 100.0%; Applied: 33323200; Backlog: 12/1000; Time: 98h59m30s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:855325; State: postponing cut-over; ETA: due
# Migrating `media`.`media_bundle`; Ghost table is `media`.`_media_bundle_gho`
# Migrating ip-172-17-1-238:3306; inspecting ip-172-17-3-149:3306; executing on ip-10-0-9-219
# Migration started at Thu Apr 19 04:03:47 +0000 2018
# chunk-size: 1000; max-lag-millis: 500ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-flag-file: ./throttle.flag
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: ./cutover.flag [set]
# panic-flag-file: ./panic.flag
# Serving on unix socket: /tmp/gh-ost.media.media_bundle.sock
Copy: 883626763/883626763 100.0%; Applied: 33325625; Backlog: 33/1000; Time: 99h0m0s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:4508481; State: postponing cut-over; ETA: due
Copy: 883626763/883626763 100.0%; Applied: 33327979; Backlog: 72/1000; Time: 99h0m30s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:8145010; State: postponing cut-over; ETA: due
Copy: 883626763/883626763 100.0%; Applied: 33330435; Backlog: 13/1000; Time: 99h1m0s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:11853964; State: postponing cut-over; ETA: due
2018-04-23 07:04:51 INFO Grabbing voluntary lock: gh-ost.9124052.lock
2018-04-23 07:04:51 INFO Setting LOCK timeout as 6 seconds
2018-04-23 07:04:51 INFO Looking for magic cut-over table
2018-04-23 07:04:51 INFO Creating magic cut-over table `media`.`_media_bundle_del`
2018-04-23 07:04:52 INFO Magic cut-over table created
2018-04-23 07:04:52 INFO Locking `media`.`media_bundle`, `media`.`_media_bundle_del`
2018-04-23 07:04:52 INFO Tables locked
2018-04-23 07:04:52 INFO Session locking original & magic tables is 9124052
2018-04-23 07:04:52 INFO Writing changelog state: AllEventsUpToLockProcessed:1524467092042964910
2018-04-23 07:04:52 INFO Waiting for events up to lock
2018-04-23 07:04:52 INFO Intercepted changelog state AllEventsUpToLockProcessed
2018-04-23 07:04:52 INFO Handled changelog state AllEventsUpToLockProcessed
2018-04-23 07:04:52 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1524467092042964910
2018-04-23 07:04:52 INFO Done waiting for events up to lock; duration=403.148675ms
# Migrating `media`.`media_bundle`; Ghost table is `media`.`_media_bundle_gho`
# Migrating ip-172-17-1-238:3306; inspecting ip-172-17-3-149:3306; executing on ip-10-0-9-219
# Migration started at Thu Apr 19 04:03:47 +0000 2018
# chunk-size: 1000; max-lag-millis: 500ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-flag-file: ./throttle.flag
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: ./cutover.flag
# panic-flag-file: ./panic.flag
# Serving on unix socket: /tmp/gh-ost.media.media_bundle.sock
Copy: 883626763/883626763 100.0%; Applied: 33330765; Backlog: 0/1000; Time: 99h1m4s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:12390373; State: migrating; ETA: due
2018-04-23 07:04:52 INFO Setting RENAME timeout as 3 seconds
2018-04-23 07:04:52 INFO Session renaming tables is 9131747
2018-04-23 07:04:52 INFO Issuing and expecting this to block: rename /* gh-ost */ table `media`.`media_bundle` to `media`.`_media_bundle_del`, `media`.`_media_bundle_gho` to `media`.`media_bundle`
2018-04-23 07:04:52 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2018-04-23 07:04:52 INFO Checking session lock: gh-ost.9124052.lock
2018-04-23 07:04:52 INFO Connection holding lock on original table still exists
2018-04-23 07:04:52 INFO Will now proceed to drop magic table and unlock tables
2018-04-23 07:04:52 INFO Dropping magic cut-over table
2018-04-23 07:04:52 INFO Releasing lock from `media`.`media_bundle`, `media`.`_media_bundle_del`
2018-04-23 07:04:52 INFO Tables unlocked
2018-04-23 07:04:52 INFO Tables renamed
2018-04-23 07:04:52 INFO Lock & rename duration: 494.933783ms. During this time, queries on `media_bundle` were blocked
2018-04-23 07:04:52 INFO Looking for magic cut-over table
2018/04/23 07:04:52 binlogsyncer.go:107: [info] syncer is closing...
2018/04/23 07:04:52 binlogstreamer.go:47: [error] close sync with err: sync is been closing...
2018/04/23 07:04:52 binlogsyncer.go:122: [info] syncer is closed
2018-04-23 07:04:52 INFO Closed streamer connection. err=<nil>
2018-04-23 07:04:52 INFO Dropping table `media`.`_media_bundle_ghc`
2018-04-23 07:04:52 INFO Table dropped
2018-04-23 07:04:52 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2018-04-23 07:04:52 INFO -- drop table `media`.`_media_bundle_del`
2018-04-23 07:04:52 INFO Done migrating `media`.`media_bundle`
2018-04-23 07:04:52 INFO Removing socket file: /tmp/gh-ost.media.media_bundle.sock
2018-04-23 07:04:52 INFO Tearing down inspector
[MySQL] 2018/04/23 07:04:52 packets.go:118: write tcp 10.0.9.219:55570->10.0.16.20:3306: write: connection timed out
2018-04-23 07:04:52 INFO Tearing down applier
[MySQL] 2018/04/23 07:04:52 packets.go:118: write tcp 10.0.9.219:48704->10.0.18.162:3306: write: connection timed out
2018-04-23 07:04:52 INFO Tearing down streamer
2018-04-23 07:04:52 INFO Tearing down throttler
# Done
--- All done

Interesting take from this, during the last minutes before the cutover I still see some backlog being processed.

Copy: 883626763/883626763 100.0%; Applied: 33330435; Backlog: 13/1000; Time: 99h1m0s(total), 48h31m10s(copy); streamer: mysql-bin-changelog.077450:11853964; State: postponing cut-over; ETA: due
2018-04-23 07:04:51 INFO Grabbing voluntary lock: gh-ost.9124052.lock

However on the previous runs, the backlog was always 0. So probably it might have stopped receiving bin_log events even before the rows were fully copied. I'm going to look at our previous logs to see if I can see something useful.

luoxiaocai91 commented 6 years ago

we had this problem too。why? Copy: 5158614/5158614 100.0%; Applied: 0; Backlog: 0/1000; Time: 3m5s(total), 2m56s(copy); streamer: mysql-bin.000206:485009551; State: migrating; ETA: due 2018-05-14 17:10:26 ERROR Timeout while waiting for events up to lock 2018-05-14 17:10:26 ERROR 2018-05-14 17:10:26 ERROR Timeout while waiting for events up to lock 2018-05-14 17:10:26 INFO Looking for magic cut-over table 2018-05-14 17:10:26 INFO Will now proceed to drop magic table and unlock tables 2018-05-14 17:10:26 INFO Dropping magic cut-over table 2018-05-14 17:10:26 INFO Dropping magic cut-over table

shlomi-noach commented 6 years ago

Which MySQL version is involved? Parallel replication? Thread cache? Any other interesting setup?