github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.44k stars 1.26k forks source link

Data Truncated for column 'X' on Binlog Updates #469

Open Xopherus opened 7 years ago

Xopherus commented 7 years ago

Hello there! First off, thank you for this fantastic tool. My team chose gh-ost over percona specifically due to the rich featureset that gh-ost provides, including built-in throttling, live commands, and delayed cutover. It's worked like a charm so far.

I've been using gh-ost to update some column types in a very large table (~2 billion rows). I was able to run full tests on our Dev and Staging environments which worked without a hitch. However, when I ran this on our production database I found some behavior that I did not expect (nor did I see this on the previous tests).

ERROR Error 1265: Data truncated for column 'type_post' at row 1; query=
            replace /* gh-ost `alerts`.`_post_gho` */ into
                `alerts`.`_post_gho`
                    (`id`, `post_number`, `created`, `type_post`)
                values
                    (?, ?, ?, ?)
        ; args=[2062578694 1389934815931205648_2691385137 <nil> ]

This error happened about 30 times and then the command panicked and exited. Upon further inspection I realized that the data that was being inserted was invalid - if you look at the number of args (3) vs the number of expected values (4) they don't seem to match up. HOWEVER, this is because the row's type_post column has an empty string value, so it doesn't show up in the log.

This type_post column in particular is interesting because it's an enum type which accepts two values (post or comment). I was able to inspect the data for other rows which also had invalid values for type_post - and it turns out there were several of them. In fact, the dev and staging databases that I tested on ALSO had these invalid rows - but I didn't know why I didn't see this error before. After a bit of digging, I think I understand what's going on.

When gh-ost copies data from the original table to the ghost table, it uses insert ignore syntax (see buildRangeInsertQuery). This specifically suppresses errors from mysql, including errors which would be raised when inserting invalid data into an enum field. HOWEVER when inspecting binlog events, gh-ost will use replace into syntax (see BuildDMLInsertQuery).

My hunch is that when testing on dev/staging, gh-ost did not read any update events from the binlog which would have affected an invalid row, which means those rows were copied using insert ignore but never updated once they were copied. When running the same command on production, it failed because those rows were updated AFTER being copied.

My question is: should this behavior be handled within gh-ost? I think this particular scenario points out a serious flaw on my own database architecture (which I'm addressing). However, I believe that gh-ost should be consistent in how it handles errors - if buildRangeInsertQuery is allowed to ignore errors, then BuildDMLInsertQuery should as well. Or maybe both of them should fail on errors?

shlomi-noach commented 7 years ago

@Xopherus thank you for this excellent investigation.

This error happened about 30 times and then the command panicked and exited.

That is because gh-ost retries on error and eventually panics.

When gh-ost copies data from the original table to the ghost table, it uses insert ignore syntax (see buildRangeInsertQuery). This specifically suppresses errors from mysql, including errors which would be raised when inserting invalid data into an enum field. HOWEVER when inspecting binlog events, gh-ost will use replace into syntax (see BuildDMLInsertQuery).

correct.

My hunch is that when testing on dev/staging, gh-ost did not read any update events from the binlog which would have affected an invalid row, which means those rows were copied using insert ignore but never updated once they were copied. When running the same command on production, it failed because those rows were updated AFTER being copied.

Makes sense. Indeed, what gh-ost cannot emulate not test is the particular workload at the time of execution. For large tables, this is usually not an issue, because on long migrations you will eventually meet all the insert, delete, update statements. However in your case there were some specific updates with invalid enum values, that were so sparse that your test run never caught them.

My question is: should this behavior be handled within gh-ost? I think this particular scenario points out a serious flaw on my own database architecture (which I'm addressing).

To begin with, gh-ost's refusal to apply the "bad" values is intentional.

Apparently your servers run with non-strict sql_mode. This is the "flaw on my own database architecture" you refer to. That is, the app should not have placed those values, but then the database should have refused them.

gh-ost is explicitly setting sql_mode (+=) STRICT_ALL_TABLES (see #164 , #191). This seems to have saved the day for a few users.

However, I believe that gh-ost should be consistent in how it handles errors - if buildRangeInsertQuery is allowed to ignore errors, then BuildDMLInsertQuery should as well. Or maybe both of them should fail on errors?

This is a very good observation.

With current row-copy method it will unfortunately be not feasible to detect the particular Error 1265. This is because we're copying N rows at a time (e.g. 100 at a time) and we only get one error code back. We IGNORE those because duplicate key errors are expected, and those are the erros we're hoping for. But as you suggest, there could be other errors, and they're not being reported.

This may be solved with "managed row copy" or gh-ost-based row copy, which is something we were working on, then set aside, and then we may come back to this soon. In this method we read values from table via gh-ost and write them back to the ghost table. Not via insert ignore ... select but via select and insert ignore.

Summary: your observations are all correct; handling this scenario may happen in the future as per https://github.com/github/gh-ost/pull/91 or derived work.