kalenjordan / magerun-addons

MageRun Addons
116 stars 30 forks source link

Massive customer:anon speedup by leveraging MySQL to do the heavy liftin... #7

Closed jonathanvaughn closed 10 years ago

jonathanvaughn commented 10 years ago

We've been using your wonderful customer:anon function for awhile, but recently we've had some rather large (in the most recent case, over 500,000 rows just on customer_entity) customer DBs to anonymize for development work, and it can take a long time. This most recent case, we decided to find a way to do this faster, because from eyeballing the progress we were guessing at anywhere from 60 to 100 hours to complete just customer_entity. D:

From checking with XDebug, it appears nearly all the time was spent waiting on MySQL. We did some experimenting and found some various different methods, and then tested them, etc. It seems that MySQL spends a lot of time searching for the row and/or compiling the query vs the actual time to change the row, because of how much a difference handing one query per table to the server (and letting it do all the work) makes vs the original implementation (handling each email address one at a time as a separate query).

This speedup change still keeps the emails consistent across tables, while still making them non-recoverable (well, at least for any trivial MD5 attack, since the salt is random and not stored) and (very probably - the nature of hashes being what they are, there's a small chance of collision) unique.

We had already used this technique manually on the database in question and it took just a few minutes, though we've yet to do a full scale test again and I didn't actually time it, so I can't say yet what the full DB speedup is, but we did some testing with only 10,000 rows per table:

I'm currently re-importing the original DB into my test DB with the full ~half a million rows+, but it's going to take awhile so I'll do a full scale test tomorrow. Meanwhile, here's a pull request :D

kalenjordan commented 10 years ago

@jonathanvaughn brilliant! It's a little embarrassing to see my implementation alongside yours now! I rolled this out quickly and didn't take the time to properly optimize it since I hadn't much need for it yet. I think one store that had around ~200k customers I literally just waited a day and let it run in the background, since it didn't really block any of my work to just let it run.

Very nicely done! And thanks for the contribution!

jonathanvaughn commented 10 years ago

Just for completeness, here's the time it took on the full copy of the DB:

real 5m51.681s user 0m1.445s sys 0m0.081s

Most of that is obviously on the mysql server side, but even then, ~6 minutes beats a week or so :D

kalenjordan commented 10 years ago

:+1: