github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.27k stars 1.25k forks source link

when add unique index may lose data #167

Closed rj03hou closed 8 years ago

rj03hou commented 8 years ago

table and data

create table test (id int not null primary key, name char(255));
insert into test values(1,'a'),(2,'b'),(3,'a');

alter table test add unique index uniq_name(name);

when execute the alter ddl, the value of (1,'a') should loss, and pt-online-schema-change have this problem tool. I think the good solution is check if have duplicate value then fail, or have a config not allow add unique index.

shlomi-noach commented 8 years ago

This is beyond the scope of gh-ost to handle, and is within the scope of the DBA avoiding shooting herself in the foot.

Checking for duplicates could take hours, and by the time the check is complete, new duplicates may be introduced.

rj03hou commented 8 years ago

when we use the pt-osc, and some RD complain that use our tool have lose data. so if they add unique index we suggest they must be careful.

shlomi-noach commented 8 years ago

Good suggestion.

isegal commented 7 years ago

Is there a way to make gh-ost fail with an exit code if there is an attempt to insert a row with a duplicate key during the copy operation? In other words, we are ok with taking hours on large tables so long as there is no silent failure with a data loss. The preferred way is to notify the user and exit, allowing the invoking tools/scripts to take appropriate action such as cleaning up the temporary gh-ost tables, etc.

shlomi-noach commented 7 years ago

Is there a way to make gh-ost fail with an exit code if there is an attempt to insert a row with a duplicate key during the copy operation?

I believe this to be impractical, adding a SELECT for every single UPDATE or INSERT gh-ost executes, potentially on a non-indexed column.

isegal commented 7 years ago

Unless you are doing upsert, the INSERT operation itself will throw a 'duplicate key' error if an attempt is made to insert a unique value that already exists. Not sure about UPDATE, but I can do a quick test if you like.

isegal commented 7 years ago

Just did a quick glance, apologize ahead of time if I'm looking in the wrong code. https://github.com/github/gh-ost/blob/master/go/sql/builder.go#L218 it looks like it's doing an 'insert ignore', which would (by design?) ignore any issues such as a duplicate key. What we'd like to propose is some kind of command line switch that will do a non-ignore insert, and cause gh-ost to exit with an error if there is an issue writing to the destination table. I.e. a strict mode so to speak.

shlomi-noach commented 7 years ago

@isegal there's both insert ignore and replace into involved.

Both are needed because gh-ost populates the ghost table in two ways:

The two may (and often do) interfere with each other. We may process an INSERT statement in the binary log after having copied the row into the table.

And further considering the question, I'm not sure this is at all possible. On the ghost table, the data is most of the time incomplete (when the data is complete we proceed to cut-over the migration). When the data is incomplete I'm unable to tell there would be a UNIQUE violation.

An alternative is to read from the original table.

Perhaps I'm missing something here but right now I don't think this can work.

isegal commented 7 years ago

Understood. Thank you for the clarification.

grypyrg commented 7 years ago

pt-online-schema-change has the same problem at the moment. I don't see an easy way to prevent this for all use cases, besides warning the user of this prior to accepting the schema change.