github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.34k stars 1.25k forks source link

StreamEvents encountered unexpected error #570

Closed FortuneOSarcasm closed 6 years ago

FortuneOSarcasm commented 6 years ago

I'm busy trying to test with gh-ost for the first time and I'm getting the following error

binlogstreamer.go:47: [error] close sync with err: unsupport type 0 in binlog and don't know how to handle INFO StreamEvents encountered unexpected error: unsupport type 0 in binlog and don't know how to handle

I'm running the following command the execute the test

gh-ost \ --user="ghost" \ --assume-master-host=dbserver-replica \ --password="xxxxxxxx" \ --host=dbserver-master \ --test-on-replica \ --database="db1" \ --table="table1" \ --verbose \ --alter="CHARSET=utf8" \ --initially-drop-ghost-table \ --initially-drop-old-table \ --max-load=Threads_running=30 \ --switch-to-rbr \ --chunk-size=500 \ --cut-over=default \ --exact-rowcount \ --concurrent-rowcount \ --serve-socket-file=/home/user/work/gh-ost.test.sock \ --panic-flag-file=/home/user/work/gh-ost.panic.flag \ --execute --stack --debug --verbose

I'm using the latest version of gh-ost

shlomi-noach commented 6 years ago

The error

binlogstreamer.go:47: [error] close sync with err: unsupport type 0 in binlog and don't know how to handle

... comes from an underlying library gh-ost uses, and is OK to ignore. I understand it may be daunting to see it. Next release will have a bit fewer of these; in particular there's a fix to the JSON data type.

Since you're evaluating gh-ost, may I urge you to use the built-in testing capability with which you can build your trust?

FortuneOSarcasm commented 6 years ago

Unfortunately even if it's allowed to run it fails with the following.

2018-04-16 18:41:13 FATAL 61 successive failures in streamer reconnect at coordinates mysql- bin.003453:4 goroutine 8 [running]: runtime/debug.Stack(0x68, 0x100, 0xc4200ca370) /usr/local/go/src/runtime/debug/stack.go:24 +0xa7 runtime/debug.PrintStack() /usr/local/go/src/runtime/debug/stack.go:16 +0x22 github.com/outbrain/golib/log.logErrorEntry(0x0, 0x93ad00, 0xc4202a04b0, 0x1, 0x101) /home/andrew.jack/work/src/github.com/outbrain/golib/log/log.go:178 +0xfd github.com/outbrain/golib/log.Fatale(0x93ad00, 0xc4202a04b0, 0x0, 0x0) /home/andrew.jack/work/src/github.com/outbrain/golib/log/log.go:255 +0x3e github.com/github/gh-ost/go/logic.(Migrator).listenOnPanicAbort(0xc420130090) /home/andrew.jack/work/src/github.com/github/gh-ost/go/logic/migrator.go:224 +0x6a created by github.com/github/gh-ost/go/logic.(Migrator).Migrate /home/andrew.jack/work/src/github.com/github/gh-ost/go/logic/migrator.go:291 +0x25e

shlomi-noach commented 6 years ago

Huh. Can you please share:

FortuneOSarcasm commented 6 years ago

Looks like it's not really ever copying anything

2018-04-17 07:18:01 INFO Dropping table hostel_rob._SuppInfo_gho 2018-04-17 07:18:01 INFO Table dropped 2018-04-17 07:18:01 INFO Dropping table hostel_rob._SuppInfo_del 2018-04-17 07:18:01 INFO Table dropped 2018-04-17 07:18:01 INFO Dropping table hostel_rob._SuppInfo_ghc 2018-04-17 07:18:01 INFO Table dropped 2018-04-17 07:18:01 INFO Creating changelog table hostel_rob._SuppInfo_ghc 2018-04-17 07:18:01 INFO Changelog table created 2018-04-17 07:18:01 INFO Creating ghost table hostel_rob._SuppInfo_gho 2018-04-17 07:18:01 INFO Ghost table created 2018-04-17 07:18:01 INFO Altering ghost table hostel_rob._SuppInfo_gho 2018-04-17 07:18:01 DEBUG ALTER statement: alter / gh-ost / table hostel_rob._SuppInfo_gho CHARSET=utf8 2018-04-17 07:18:01 INFO Ghost table altered 2018-04-17 07:18:01 INFO Waiting for ghost table to be migrated. Current lag is 0s 2018-04-17 07:18:06 INFO Reconnecting... Will resume at :0 2018/04/17 07:18:06 binlogsyncer.go:79: [info] create BinlogSyncer with config {99999 mysql cw-db266-98.cw.hostelworld.com 3306 ghost false false } 2018-04-17 07:18:06 INFO Connecting binlog streamer at mysql-bin.003458:4 2018/04/17 07:18:06 binlogsyncer.go:246: [info] begin to sync binlog from position (mysql-bin.003458, 4) 2018/04/17 07:18:06 binlogsyncer.go:139: [info] register slave for master server cw-db266-98.cw.hostelworld.com:3306 2018/04/17 07:18:06 binlogsyncer.go:573: [info] rotate to (mysql-bin.003458, 4) 2018/04/17 07:18:06 binlogstreamer.go:47: [error] close sync with err: unsupport type 0 in binlog and don't know how to handle

And the table structure

CREATE TABLE SuppInfo ( InfoNO int(11) NOT NULL AUTO_INCREMENT, HostelNumber int(11) NOT NULL DEFAULT '0', Type varchar(30) NOT NULL DEFAULT '', TourID varchar(30) NOT NULL DEFAULT '', Text text NOT NULL, TourUID int(11) NOT NULL DEFAULT '0', LastModified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (InfoNO), KEY HostelNumber (HostelNumber), KEY TourUID (TourUID), KEY Type (Type), KEY TourUID_2 (TourUID), KEY hostelno_type (HostelNumber,Type), KEY idx_LastModified (LastModified) ) ENGINE=MyISAM DEFAULT CHARSET=latin1

shlomi-noach commented 6 years ago

Please add

Log up to first copying of rows

shlomi-noach commented 6 years ago

^Clarify: entire prefix of log, up to the first (Attempt) of row copy.

shlomi-noach commented 6 years ago

Can you please clarify:

gh-ost
--user="ghost"
--assume-master-host=dbserver-replica
--password="xxxxxxxx"
--host=dbserver-master

why is host=dbserver-master and then assume-master-host=dbserver-replica? This seems to be upside down.

shlomi-noach commented 6 years ago

Clarify: on test-on-replica you really want to connect to the replica, which is the whole point. So you want --host=dbserver-replica.

Unless you have a very good reason, you shouldn't be using assume-master-host, and you should let gh-ost figure this one out.

FortuneOSarcasm commented 6 years ago

Here's the full log. It does not appear to ever copy anything. ghost.log

FortuneOSarcasm commented 6 years ago

Sorry, that was a typo. I'm running

gh-ost --user="ghost" --assume-master-host=db-master --password="XXXXXXXXX" --host=db-replica --test-on-replica --database="db" --table="SuppInfo" --verbose --alter="CHARSET=utf8" --initially-drop-ghost-table --initially-drop-old-table --max-load=Threads_running=30 --switch-to-rbr --chunk-size=500 --cut-over=default --exact-rowcount --concurrent-rowcount --serve-socket-file=/home/user/work/gh-ost.test.sock --panic-flag-file=/home/user/work/gh-ost.panic.flag --execute --debug --stack --verbose

shlomi-noach commented 6 years ago

Only thing that catches my eye is version 10.1.32-MariaDB. Wondering whether:

What happens if you remove -assume-master-host?

FortuneOSarcasm commented 6 years ago

If I remove that it's trying to use a different host as master which I don't want for testing purposes. I'm not using GTID or Multisource replication. The only "special" thing we're doing is using a binlog server but I'm forcing it to use an actual server for the Master rather than the binlog service.

shlomi-noach commented 6 years ago

but I'm forcing it to use an actual server for the Master rather than the binlog service.

forcing "it" meaning forcing gh-ost, correct?

Using a binlog server is pretty special. I wonder whether you can test on a normal master-replica setup, that does not involve a binlog server, and see whether that works?

I'm not using binlog servers myself, nor do I expect gh-ost to support them. While ideally they should be transparent, maybe they are not. Out of curiosity, which binlog server (version) are you using?

FortuneOSarcasm commented 6 years ago

Yes, forcing gh-ost. I do have some servers using regular master/slave which I can test on but I don't see why it shouldn't work when it's using a real MariaDB server as it's master and the slave is also just a regular slave. As far as gh-ost is aware there are no binlog servers.

shlomi-noach commented 6 years ago

I don't see why it shouldn't work when it's using a real MariaDB server as it's master and the slave is also just a regular slave. As far as gh-ost is aware there are no binlog servers.

Computers are always full of surprises. I'm genuinely curious to try it out, if you please.

FortuneOSarcasm commented 6 years ago

It's currently running on a regular master/slave setup and it's working. The my.cnf settings are identical to the original 2 servers I tested against earlier.

shlomi-noach commented 6 years ago

@FortuneOSarcasm thank you for testing and for affirming my suspicion. Out of curiosity, which binlog server (version) are you using?

FortuneOSarcasm commented 6 years ago

We're running Maxscale 2.1

FortuneOSarcasm commented 6 years ago

I have noticed 2 other issues. gh-ost does not accept passwords longer than 32 characters and the build.sh fails if you're running go version 1.10 or higher due to how it checks version.

shlomi-noach commented 6 years ago

I have noticed 2 other issues.

Can you please open two new issues?

shlomi-noach commented 6 years ago

But, regarding password length, see https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-5.html and https://github.com/github/gh-ost/pull/489

FortuneOSarcasm commented 6 years ago

The test on traditional master/slave failed at the end.

2018-04-17 10:32:45 INFO Row copy complete
Copy: 3344433/3344433 100.0%; Applied: 0; Backlog: 0/1000; Time: 17m47s(total), 5m23s(copy); streamer: mysql-bin.000403:623191580; State: migrating; ETA: due
2018-04-17 10:32:45 INFO Stopping replication IO thread
2018-04-17 10:32:45 INFO Replication IO thread stopped
2018-04-17 10:32:45 INFO Verifying SQL thread is stopped
2018-04-17 10:32:45 INFO SQL thread stopped
2018-04-17 10:32:45 INFO Replication IO thread at mysql-bin.000399:789569656. SQL thread is at mysql-bin.000399:789569656
2018-04-17 10:32:45 INFO Grabbing voluntary lock: gh-ost.1442305.lock
2018-04-17 10:32:45 INFO Setting LOCK timeout as 6 seconds
2018-04-17 10:32:45 INFO Looking for magic cut-over table
2018-04-17 10:32:45 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:45 INFO Creating magic cut-over table `hostel_rob`.`_SuppInfo_del`
2018-04-17 10:32:45 INFO Magic cut-over table created
2018-04-17 10:32:45 INFO Locking `hostel_rob`.`SuppInfo`, `hostel_rob`.`_SuppInfo_del`
2018-04-17 10:32:45 INFO Tables locked
2018-04-17 10:32:45 INFO Session locking original & magic tables is 1442305
2018-04-17 10:32:45 INFO Writing changelog state: AllEventsUpToLockProcessed:1523961165691429000
2018-04-17 10:32:45 INFO Intercepted changelog state AllEventsUpToLockProcessed
2018-04-17 10:32:45 INFO Handled changelog state AllEventsUpToLockProcessed
2018-04-17 10:32:45 INFO Waiting for events up to lock
2018-04-17 10:32:45 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:45 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:45 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1523961165691429000
2018-04-17 10:32:46 INFO Done waiting for events up to lock; duration=898.84461ms
# Migrating `hostel_rob`.`SuppInfo`; Ghost table is `hostel_rob`.`_SuppInfo_gho`
# Migrating dev-db04.dev.hostelworld.com:3306; inspecting dev-db04.dev.hostelworld.com:3306; executing on dev-db04.dev.hostelworld.com
# Migration started at Tue Apr 17 10:14:58 +0000 2018
# chunk-size: 500; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=30; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 1
# panic-flag-file: /home/andrew/work/gh-ost.panic.flag
# Serving on unix socket: /home/andrew.jack/work/gh-ost.test.sock
Copy: 3344433/3344433 100.0%; Applied: 0; Backlog: 0/1000; Time: 17m48s(total), 5m23s(copy); streamer: mysql-bin.000403:623195421; State: migrating; ETA: due
2018-04-17 10:32:46 INFO Setting RENAME timeout as 3 seconds
2018-04-17 10:32:46 INFO Session renaming tables is 1442311
2018-04-17 10:32:46 INFO Issuing and expecting this to block: rename /* gh-ost */ table `hostel_rob`.`SuppInfo` to `hostel_rob`.`_SuppInfo_del`, `hostel_rob`.`_SuppInfo_gho` to `hostel_rob`.`SuppInfo`
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:46 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2018-04-17 10:32:47 INFO Checking session lock: gh-ost.1442305.lock
2018-04-17 10:32:47 INFO Connection holding lock on original table still exists
2018-04-17 10:32:47 INFO Will now proceed to drop magic table and unlock tables
2018-04-17 10:32:47 INFO Dropping magic cut-over table
2018-04-17 10:32:47 INFO Releasing lock from `hostel_rob`.`SuppInfo`, `hostel_rob`.`_SuppInfo_del`
2018-04-17 10:32:47 INFO Tables unlocked
2018-04-17 10:32:47 ERROR replication not running; Slave_IO_Running=No, Slave_SQL_Running=No
2018-04-17 10:32:47 INFO Tables renamed
2018-04-17 10:32:47 INFO Lock & rename duration: 2.017519272s. During this time, queries on `SuppInfo` were blocked
2018-04-17 10:32:47 INFO Looking for magic cut-over table
2018-04-17 10:32:47 INFO Renaming back both tables
2018/04/17 10:32:47 binlogsyncer.go:107: [info] syncer is closing...
2018/04/17 10:32:47 binlogstreamer.go:47: [error] close sync with err: sync is been closing...
2018/04/17 10:32:47 binlogsyncer.go:122: [info] syncer is closed
2018-04-17 10:32:47 INFO Closed streamer connection. err=<nil>
2018-04-17 10:32:47 INFO Dropping table `hostel_rob`.`_SuppInfo_ghc`
2018-04-17 10:32:47 INFO Table dropped
2018-04-17 10:32:47 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2018-04-17 10:32:47 INFO -- drop table `hostel_rob`.`_SuppInfo_del`
2018-04-17 10:32:47 INFO Done migrating `hostel_rob`.`SuppInfo`
2018-04-17 10:32:47 INFO Removing socket file: /home/andrew.jack/work/gh-ost.test.sock
2018-04-17 10:32:47 INFO Tearing down inspector
2018-04-17 10:32:47 INFO Tearing down applier
2018-04-17 10:32:47 INFO Tearing down streamer
2018-04-17 10:32:47 INFO Tearing down throttler
# Done
shlomi-noach commented 6 years ago

Pro tips: wrap your log between three backticks, and avoid quoting. That is, don't use >, and use ``` before and after the block.

shlomi-noach commented 6 years ago

The test on traditional master/slave failed at the end.

It did not fail at the end. It had errors + retries along the way, overcame those errors, and completed with a nice:

2018-04-17 10:32:47 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add --ok-to-drop-table next time. But I prefer you do not. To drop the old table, issue:
2018-04-17 10:32:47 INFO -- drop table hostel_rob._SuppInfo_del
2018-04-17 10:32:47 INFO Done migrating hostel_rob.SuppInfo
shlomi-noach commented 6 years ago

@FortuneOSarcasm did you notice you were testing a MyISAM table? Surely you did not intend to?

FortuneOSarcasm commented 6 years ago

Yes I did and yes that was intentional

shlomi-noach commented 6 years ago

Very good. I'm going to close this Issue if you don't mind. I do not know what MaxScale Binlog Server injected into the binary logs, and as MaxScale Binlog Server is not supported, I'm not going to further investigate. (though I should clarify in docs)

FortuneOSarcasm commented 6 years ago

Except, it never connected to a Binlog server and was only reading the local MariaDB logs.

shlomi-noach commented 6 years ago

Except, it never connected to a Binlog server and was only reading the local MariaDB logs.

That's incorrect, because gh-ost was looking at binary logs with content generated by binlog servers. The fact the same migration worked on same cluster but where binlog server was not involved clarifies that.

I would take that question to the developers of the Binlog Server, to ask what changed in the binlog contents and why. Your assumption that binlog servers are compeltely transparent is shown to be incorrect. I have no idea what the nature of the difference is, and would suggest pursuing this with them. gh-ost, in this case, merely reflects the consequence of using binlog servers, which is shown to be non-transparent. I cannot fix this, and moreover I can tell you there will be no support for MaxScale binlog servers.

FortuneOSarcasm commented 6 years ago

It wasn't the same cluster and it never used a binlog server. It was using a regular MariaDB server as the host.

shlomi-noach commented 6 years ago

@FortuneOSarcasm you previously wrote:

It's currently running on a regular master/slave setup and it's working. The my.cnf settings are identical to the original 2 servers I tested against earlier.

Do you now mean this was on a different cluster, different type of table? In which case I'd appreciate if you could please clarify what exactly you were doing.

My request was that you would test the same migration on the same cluster, on a replica that does not replicate through Binlog Server.

Now, to clarify once and for all, my understanding is: you have the following setup:

Master (MariaDB) -> MaxScale Binlog Server -> Replica (MariaDB)

is that correct?

You say that gh-ost doesn't know anything about the Binlog Server, and I say the fact the binlog server is between the master and the replica can have an effect. You say gh-ost is not aware of it, and I say I've worked with binlog servers for a year, and have seen many problems, and that the binlog server may not be as transparent as we can hope.

The experiment I want to run, and you are the only one able to run it, is to test the same migration, on the following setup:

Same Master (MariaDB) -> Replica (MariaDB)

on the same table.

Are you able to run this?

FortuneOSarcasm commented 6 years ago

I ran my second test on a different cluster using the same table. The first test was done on a server using a Binlog master so instead of allowing it to try connecting to the binlog server to use INFORMATION_SCHEMA I pointed it to a different node which was not using the Binlog. The cluster is structured so only half the nodes use binlogs as it's still in a trial period.

Gh-ost would not have connected to any Binlog server, or Master using a Binlog in any capacity. It only was reading locally generated Binlogs on a Slave that was replicating from a Binlog so those logs would not contain any "binlog changes" from a Binlog server.

shlomi-noach commented 6 years ago

Sorry, I'm afraid I do not understand. Any way you can illustrate this?

On one hand:

Gh-ost would not have connected to any Binlog server, or Master using a Binlog in any capacity.

but on the other hand:

It only was reading locally generated Binlogs on a Slave that was replicating from a Binlog

I find the two contradicting. If the replica was replicating from a Binlog Server then a binlog server is involved.

I ran my second test on a different cluster using the same table.

I actually mean the very same table, not a similar table on another cluster.

Also, I do not know what "a Binlog master" is?

shlomi-noach commented 6 years ago

I really want to help here, but I really need my questions answered, I have a few of these unanswered. Please make sure to respond to my questions.

Now, to clarify once and for all, my understanding is: you have the following setup: Master (MariaDB) -> MaxScale Binlog Server -> Replica (MariaDB) is that correct?

and

Are you able to run this?

FortuneOSarcasm commented 6 years ago

So first test was image

MariaDB doesn't change whats in the binlogs based on what the Master is so gh-ost would only have seen the local binlogs generated by MariaDB.

The second test was on a different cluster just running on a replica connecting to the real master. So the second test was done as you requested on a straight Master to Slave replication set on the exact same table just with slightly older data.

shlomi-noach commented 6 years ago

Wow I've never seen this kind of setup and honestly I don't understand how it works.

  1. Can you please run gh-ost --test-on-replica on the direct master's replica? (let's call it R1)

  2. There must be an error. You illustrated assume-master on the replica R1? assume-master should point to the master.

shlomi-noach commented 6 years ago

So the second test was done as you requested on a straight Master to Slave replication set on the exact same table just with slightly older data.

To be very clear, that wasn't what I requested. I want you to test on the same cluster.

FortuneOSarcasm commented 6 years ago

I cannot unfortunately. I was testing using a replica specifically set aside for testing things using our real environment that would not impact any other systems as it's outside of all loadbalancers.

TBH, you may as well close this issue. gh-ost doesn't behave like I expected and it's not flexible enough for what I wanted to do. I'm investigating binlog readers instead.

shlomi-noach commented 6 years ago

cool -- good luck!