mozilla / addons

☂ Umbrella repository for Mozilla Addons ✨
Other
127 stars 41 forks source link

Remaining legacy migration problems before django migrations #6974

Closed eviljeff closed 5 years ago

eviljeff commented 5 years ago

Prod has the following problems that need addressing

In addition, we found this issue on stage and will affect prod too

diox commented 5 years ago

My plan is to divide the rest of the migrations into two categories:

We should review the migrations to see if there are ALTER TABLE that that add a constraint that was not there before, things like that, and prepare the relevant statement to fix the data, adding that to the first list. Then:

diox commented 5 years ago

Note that we don't necessarily have to do all migrations in order. We do want to track which ones we've ran where though. I'm thinking in particular of running the one that touches update_counts separately from the rest: it's an ALTER, but it doesn't need the whole site to be put in read-only - it just needs the stats commands that are run via cron to not run while it's taking place.

willdurand commented 5 years ago

Sounds like a good plan

eviljeff commented 5 years ago

mozilla/addons#6976 is all the migrations as a checklist

diox commented 5 years ago

Addition to the plan: if ops have time, it'd be good to have a copy of the production database on which to run the update_counts ALTER TABLE to see how long it'd take in prod.

diox commented 5 years ago

In fact we should do this for all migrations. So, the plan for today is:

DELETE FROM update_counts USING update_counts LEFT JOIN addons ON addon_id = addons.id WHERE addons.id IS NULL AND addon_id IS NOT NULL;

UPDATE versions SET license_id=5 WHERE license_id IN (6316,6317); DELETE FROM licenses WHERE id IN (6316, 6317);

- Make a clone of production db that includes those fixes, and try to run every migration in https://github.com/mozilla/addons/issues/6976 one by one (maybe doing them in reverse order they are listed, to avoid the big ones first)
- Note how much time they take, to prepare for running them on Tuesday. If any of them fails, build a fix to clean up the corresponding table
- Bonus: Run the data fixes for username (run this multiple times until all data is fixed, altering the `LIMIT` value if needed. Use a new sql shell each time to automatically drop the temporary table):
```sql
CREATE TEMPORARY TABLE `duplicate_usernames`
    SELECT LOWER(`username`) AS `username` FROM `users`
        GROUP BY LOWER(`username`) HAVING COUNT(*) > 1 LIMIT 50000;
UPDATE `users`
    SET `username`=CONCAT('anonymous-', MD5(RAND()))
    WHERE LOWER(`username`) != `username`
    AND LOWER(`username`) IN (SELECT `username` FROM `duplicate_usernames`);
bqbn commented 5 years ago

Finished the 4 queries

MySQL [addons_mozilla_org]> DELETE FROM `file_validation` USING `file_validation`
    ->     LEFT JOIN `files` ON `file_id` = `files`.`id`
    ->     WHERE `files`.`id` IS NULL AND `file_id` IS NOT NULL;
Query OK, 1 row affected (35.24 sec)

MySQL [addons_mozilla_org]> DELETE FROM `update_counts` USING `update_counts`
    ->     LEFT JOIN `addons` ON `addon_id` = `addons`.`id`
    ->     WHERE `addons`.`id` IS NULL AND `addon_id` IS NOT NULL;
Query OK, 49448 rows affected (1 hour 12 min 27.54 sec)

MySQL [addons_mozilla_org]> UPDATE `versions` SET `license_id`=5 WHERE `license_id` IN (6316,6317);
Query OK, 24 rows affected (0.02 sec)
Rows matched: 24  Changed: 24  Warnings: 0

MySQL [addons_mozilla_org]> DELETE FROM `licenses` WHERE `id` IN (6316, 6317);
Query OK, 2 rows affected (0.01 sec)
bqbn commented 5 years ago

Note how much time they take, to prepare for running them on Tuesday. If any of them fails, build a fix to clean up the corresponding table

All migrations except 1200 have finished testing. Below are the results. And note that results from migrations that took less than 1 minute are omitted.

ALTER TABLE in migration 1200 has been running 4.5 hours and not finished yet.

diox commented 5 years ago

Ok cool. So we should run all remaining migrations except 1200 today in read-only mode. Based on those results they should all apply cleanly now, and the whole thing should take around one hour.

Then we'll need to figure out the usernames one and the update counts. For the update counts, it depends on how long it takes, but since nothing should write to this table except for the update_counts_from_file cron that runs every day, maybe it should be ok to run it while the site is operating normally and just let it take a long time?

bqbn commented 5 years ago

ALTER TABLE in migration 1200 has been running 4.5 hours and not finished yet.

Now it's finished with this amount of time taken.

Query OK, 379915506 rows affected (9 hours 55 min 8.37 sec)
Records: 379915506  Duplicates: 0  Warnings: 0
bqbn commented 5 years ago

So regarding updating the users table, I have tried the CREATE TEMPORARY TABLE + UPDATE user method on a test db. Here are a couple of comments.

  1. The MD5(RAND()) is not random enough. It can only generate 10,000 to 15,000 unique random strings. During my test, first I had to change the LIMIT 50000 clause to LIMIT 5000 in order to get a successful run for UPDATE users. And then with LIMIT 5000, I can only run the 2 queries (CREATE TEMPORARY TABLE + UPDATE users) twice. I always get an error for the third run, i.e., ERROR 1062 (23000): Duplicate entry.

  2. I had to create an index when creating the temporary table, i.e.

    CREATE TEMPORARY TABLE `duplicate_usernames` (UNIQUE (username))
      SELECT LOWER(`username`) AS `username` FROM `users`
          GROUP BY LOWER(`username`) HAVING COUNT(*) > 1 LIMIT 5000;

    Without the index, it will take 4+ hours for the UPDATE users to finish even with just 5000 duplicated usernames, and there are about 130K of them.

In summary,

  1. we need a different random algorithm that can generate at least 130K unique random strings without repeating itself.

  2. I think previously, the most time consuming part was probably the IN (SELECT username FROM duplicate_usernames) clause, because it would have to do sequential scan to find the match, with the index, we may not need the LIMIT clause when creating the temporary table;

diox commented 5 years ago

That's useful info, thanks. Can you try the same approach with the index and LOWER(HEX(RANDOM_BYTES(16))) as the random function instead of MD5(RAND()) ?

bqbn commented 5 years ago

@diox LOWER(HEX(RANDOM_BYTES(16))) worked. With that and the UNIQUE index, it took 46 seconds to finish on the test db.

Query OK, 149187 rows affected (46.36 sec)
Rows matched: 149187  Changed: 149187  Warnings: 0

https://github.com/mozilla/addons/issues/1812 says that we need to apply ALTER TABLE `users` CHANGE COLUMN `username` `username` VARCHAR (255) NOT NULL; to -prod after fixing the inconsistency, is that still true?

diox commented 5 years ago

Yes, the ALTER is what gets rid of the collation we no longer want on the column.

bqbn commented 5 years ago

Hmm, the ALTER query fails with the following error,

mysql> ALTER TABLE `users` CHANGE COLUMN `username` `username` VARCHAR (255) NOT NULL;
ERROR 1062 (23000): Duplicate entry '***-häusler' for key 'username'

There seems to be a UTF8 character in that username, which may have caused the failure.

If I ran select username from users where username like '***-h%';, 6 rows were returned, and in addition to the one in the error message, another one of them was ***-häußler.

Note: *** represents a few ASCII characters, used to protect user privacy.

diox commented 5 years ago

Ah, yes, this is one consequence of the current collation we're getting rid of. It's possible there are several cases like this one, I was hoping that the check used to create the temporary duplicate_usernames usernames tables was enough, but looks like it isn't.

We can re-use the same logic and reset those usernames to an anonymous version. If there are too many of them we'll need to change the CREATE TEMPORARY TABLE + UPDATE query to include them, if not maybe we can just do it manually for those affected.

bqbn commented 5 years ago

If there are too many of them we'll need to change the CREATE TEMPORARY TABLE + UPDATE query to include them, if not maybe we can just do it manually for those affected.

I don't know how many there are. How do we find it out?

We can re-use the same logic and reset those usernames to an anonymous version.

I'm not sure about what you mean by "re-use the same logic" part, do you mean run the same CREATE TEMPORARY TABLE query twice?

That won't work because with the current CREATE TEMPORARY TABLE query, it doesn't think ***-häusler and ***-häußler are duplicated usernames, thus the ALTER TABLE failed.

diox commented 5 years ago

I don't know how many there are. How do we find it out?

I'm not sure besides just running the query multiple times and fixing the offending usernames every time, tbh. The query should already have picked it up :(

I'm not sure about what you mean by "re-use the same logic" part, do you mean run the same CREATE TEMPORARY TABLE query twice? That won't work because with the current CREATE TEMPORARY TABLE query, it doesn't think -häusler and -häußler are duplicated usernames, thus the ALTER TABLE failed.

We can manually add problematic usernames to the temporary tables (adding an INSERT INTO before the ALTER).

bqbn commented 5 years ago

Here is what i just ran on the test db, and it failed.

mysql> INSERT INTO `duplicate_usernames` (`username`) VALUES ('***-häusler');
Query OK, 1 row affected (0.00 sec)

mysql> UPDATE `users`
    ->      SET `username`=CONCAT('anonymous-', HEX(RANDOM_BYTES(16)))
    ->      WHERE LOWER(`username`) != `username`
    ->      AND LOWER(`username`) IN (SELECT `username` FROM `duplicate_usernames`);
Query OK, 0 rows affected (51.67 sec)
Rows matched: 0  Changed: 0  Warnings: 0

mysql> SELECT `username` FROM `duplicate_usernames`;
+------------------+
| username         |
+------------------+
| ***-häusler  |
+------------------+
1 row in set (0.00 sec)

mysql> ALTER TABLE `users` CHANGE COLUMN `username` `username` VARCHAR (255) NOT NULL;
ERROR 1062 (23000): Duplicate entry '***-häusler' for key 'username'

I don't know what the purpose is for LOWER(`username`) != `username` clause, but I suspect it doesn't work well with usernames that have UTF8 characters in them.

bqbn commented 5 years ago

Latest update:

So far I've found 2 usernames that have UTF8 characters in them and have caused ALTER TABLE to fail. Manually adding the 2 usernames into the temp table, and then run UPDATE without the LOWER(`username`) != `username` worked for these 2 usernames.

Current issue at hand,

mysql> ALTER TABLE `users` CHANGE COLUMN `username` `username` VARCHAR (255) NOT NULL;
ERROR 1062 (23000): Duplicate entry 'fas' for key 'username'

Even though this username doesn't have any UTF8 characters, @diox theory is that there may be other usernames that do have UTF8 characters that conflict with this one. However, there are thousands of usernames that contains fas, so it's nearly impossible to manually identify which one causes the issue.

mysql> select count(*)  from users where username like '%fas%';
+----------+
| count(*) |
+----------+
|     2633 |
+----------+

I guess one way to get around the fas username is to manually change it to something random, however, the bigger questions is how many of those cases exist in the database? Manually trying them out could easily take hours.

Can we have a programmatical way to first figure out how many such conflicts there are, then we can come up with a plan to fix it, either manually or programmatically?

bqbn commented 5 years ago

Can we have a programmatical way to first figure out how many such conflicts there are, then we can come up with a plan to fix it, either manually or programmatically?

I sort of answered the first part of my own question by running this,

mysql> select count(*) from users where username <> CONVERT(username  USING ASCII);
+----------+
| count(*) |
+----------+
|   115338 |
+----------+

With that many usernames that contain non ASCII characters, I don't think we can solve the issue manually (e.g. I've spent over an hour repeatedly running the queries manually, and only found and fixed 5 such duplicated names, and there are apparently many many more).

diox commented 5 years ago

Closing this issue, almost everything was done, we'll deal with usernames in mozilla/addons#1812