ramsey / uuid

:snowflake: A PHP library for generating universally unique identifiers (UUIDs).
https://uuid.ramsey.dev
MIT License
12.47k stars 501 forks source link

What are good strategies to migrate a database from default UUIDs to InnoDB-optimised binary UUIDs? #251

Closed metalmini closed 2 years ago

metalmini commented 6 years ago

Hi all,

I know this is not really an issue but i could not find a Slack channel or another platform for a little support. Please close if this is not appropriate, but please refer me to another platform.

Anyway, we made a mistake when implementing this bundle. We used the default settings for primary keys and i think that performance is getting worse. So i want to migrate to InnoDB-optimised binary UUID. This seems easy to implement, but i need to migrate all data to this new format. So i was wondering if you guys have any tips or tricks to do this. Any pointers would be nice.

Application is Symfony 3.4 (yes, we will migrate to 4.1), ramsey/uuid 3.8.0 and MariaDB.

ramsey commented 6 years ago

i think that performance is getting worse

I'm not familiar with how MariaDB handles keys internally, but in MySQL, my understanding is that the primary keys are used as part of all the other indexes/keys on the table, so if you have large primary keys (i.e. 36-character strings), then all your indexes/keys will be quite large. I don't know the full performance impact of this, though.

My recommendation, if you chose to use UUIDs as your primary keys, is to use the binary string form (16-byte string). The trade-off is that it makes it more difficult to do look-ups in your database.

I typically advise the use of a standard unsigned integer primary key, with a unique key on your UUID column. This avoids any performance hit that could occur because of the large UUID primary keys.

So i want to migrate to InnoDB-optimised binary UUID.

This will definitely help from a data distribution perspective. Since the PKs will be sequential with the InnoDB-optimized UUIDs, then records won't be scattered, and lookups will be faster.

To switch to the newer UUIDs, you'll need to write a migration script that reads the database records in the order you want them saved (maybe sorted by creation date, or something), generates the InnoDB-optimized UUID, and stores it to the database. I don't think you'll be able to convert your existing UUIDs to the optimized format, since they won't be sequential (unless they're already time-based, v1 UUIDs—then you might be able to do it). I would probably create a new CHAR(16) column with a unique index on it for this. Do a pass that saves the new format into that column. Then, you could promote that column to be the PK and drop the older column at a later time.

This would reduce downtime, but you have to be very careful that any foreign key relationships you have to the original PK are also changed to the new one. This also changes all of your identifiers, so anything that relies on the identifiers not changing will break.

So, I think it's possible to do what you want to do, but there's a lot of risk involved. Plan accordingly. 😄

metalmini commented 6 years ago

Thanks for the reply, much appreciated!

I sense that you would prefer to use int pk's instead of uuid's? This is something that i contemplated also. Just run the integers as a new column next to the current situation. A datatransformer to look up the uuid's is fairly simple, so the front end makes it seem we still use uuid's.

Which ever route we decide to take, it will be risky and time consuming. But in a table with 20000 records we have a huge drop in performance as it is now. So i think this will kill the application, slowly but surely.

Thanks again for the reply!

mesilov commented 5 years ago

I think this article has more ideas and solutions

https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

pounard commented 5 years ago

Sad but the article leads to a 404, but I remember that I read it a a few month ago when we had to decide if we could use uuids in our database or not. I found a handful of other links on the topic, and global answer was: with MySQL and MSSQL, don't, with PostgreSQL, do :)

We do use UUID's as primary key on most of our entity tables, and it works like a charm, PostgreSQL is definitely much more efficient with it than MySQL (which if I remember correctly doesn't have an uuid type at all).

mesilov commented 5 years ago

I just read a post on ways to scale your database that hit home with me — the author suggests the use of UUIDs (similar to GUIDs) as the primary key (PK) of database tables.

Reasons UUIDs are Good There are several reasons using a UUID as a PK would be great compared to auto-incrementing integers:

  1. At scale, when you have multiple databases containing a segment (shard) of your data, for example a set of customers, using a UUID means that one ID is unique across all databases, not just the one you’re in now. This makes moving data across databases safe. Or in my case where all of our database shards are merged onto our Hadoop cluster as one, no key conflicts.
  2. You can know your PK before insertion, which avoids a round trip DB hit, and simplifies transactional logic in which you need to know the PK before inserting child records using that key as its foreign key (FK)
  3. UUIDs do not reveal information about your data, so would be safer to use in a URL, for example. If I am customer 12345678, it’s easy to guess that there are customers 12345677 and 1234569, and this makes for an attack vector. (But see below for a better alternative).

Reasons UUIDs May Not be Good Don’t be naive A naive use of a UUID, which might look like 70E2E8DE-500E-4630-B3CB-166131D35C21, would be to treat as a string, e.g. varchar(36) — don’t do that!!

“Oh, pshaw”, you say, “no one would ever do such a thing.”

Think twice — in two cases of very large databases I have inherited at relatively large companies, this was exactly the implementation. Aside from the 9x cost in size (36 vs. 4 bytes for an int), strings don’t sort as fast as numbers because they rely on collation rules. Things got really bad in one company where they had originally decided to use Latin-1 character set. When we converted to UTF-8 several of the compound-key indexes were not big enough to contain the larger strings. Doh!

UUIDs are a pain Don’t underestimate how annoying it is to have to deal with values that are too big to remember or verbalize.

Planning for real scaling If our goal is to scale, and I mean really scale let’s first acknowledge that an int is not big enough in many cases, maxing out at around 2 billion, which needs 4 bytes. We have way more than 2 billion transactions in each of several databases. So bigint is needed in some cases and that uses 8 bytes. Meanwhile, using one of several strategies, databases like PostgreSQL and SQL Server have a native type that is stored in 16 bytes. So who cares if it’s twice as large as bigint or four times bigger than int? It’s just a few bytes, right?

Primary keys get around in normalized databases If you have a well normalized database, as we do at my current company, each use of the key as an FK starts adding up. Not just on disk but during joins and sorts these keys need to live in memory. Memory is getting cheaper, but whether disk or RAM, it’s limited. And neither is free. Our database has plenty of intermediate tables that are mainly containers for the foreign keys of others, especially in 1-to-many relations. Accounts have multiple card numbers, addresses, phone numbers, usernames, and all that. For each of these columns in a set of table with billions of accounts, the extra size of foreign keys adds up fast.

It’s really hard to sort random numbers Another problem is fragmentation — because UUIDs are random, they have no natural ordering so cannot be used for clustering. This is why SQL Server has implemented a newsequentialid() function that is suitable for use in clustered indexes, and is probably the right implementation for all UUID PKs. It is probable that there are similar solutions for other databases, certainly PostgreSQL, MySQL and likely the rest.

Primary keys should never be exposed, even UUIDs A primary key is, by definition unique within its scope. It is, therefore, an obvious thing to use as a customer number, or in a URL to identify a unique page or row. Don’t! I would argue that using a PK in any public context is a bad idea. The original issue with simple auto-incrementing values is that they are easily guessable as I noted above. Botnets will just keep guessing until they find one. (And they may keep guessing if you use UUIDs, but the chance of a correct guess is astronomically lower). Arguably it would be a fool’s errand to try to guess a UUID, however Microsoft warns against using newsequentialid() because by mitigating the clustering issue, it makes the key more guessable.

My keys will never change (until they do) But there’s a far more compelling reason not to use any kind of PK in a public context: if you ever need to change keys, all your external references are broken. Think “404 Page Not Found”. When would you need to change keys? As it happens, we’re doing a data migration this week, because who knew in 2003 when the company started that we would now have 13 massive SQL Server databases and growing fast? Never say “never”. I have been there and done that, and it has happened several times just for me. It’s easy to manage up front. It’s way harder to fix when you’re counting things in the trillions. Indeed, my current company’s context is a perfect example of why UUIDs are needed, and why they are costly, and why exposing primary keys is an issue.

My internal system is external I manage the Hadoop infrastructure that receives data nightly from all of our databases. The Hadoop system is linked (bound) to our SQL Server databases, which is fine — we’re in the same company. Still, in order to disambiguate colliding sequence keys from our multiple databases, we generate a pseudo-primary-key by concatenating two values, the id (PK) of the customer which is unique across databases (because we planned that), plus the sequence id of the table rows themselves which may collide. In so doing we have created a tight, and effectively permanent binding between years of historical customer data. If those primary keys in the RDBMS change, ours will need to also, or we’ll have some horrifying before-and-after scenario.

Best of Both? Integers Internal, UUIDs External One solution used in several different contexts that has worked for me is, in short, to use both. (Please note: not a good solution — see note about response to original post below). Internally, let the database manage data relationships with small, efficient, numeric sequential keys, whether int or bigint. Then add a column populated with a UUID (perhaps as a trigger on insert). Within the scope of the database itself, relationships can be managed using the usual PKs and FKs. But when a reference to the data needs to be exposed to the outside world, even when “outside” means another internal system, they must rely only on the UUID. This way, if you ever do have to change your internal primary keys, you can be sure it’s scoped only to one database. (Note: this is just plain wrong, as Chris observed) We used this strategy at a different company for customer data, just to avoid the “guessable” problem. (Note: avoid is different than prevent, see below). In another case, we would generate a “slug” of text (e.g. in blog posts like this one) that would make the URL a little more human friendly. If we had a duplicate, we would just append a hashed value. Even as a “secondary primary key”, using a naive use of UUIDs in string form is wrong: use the built-in database mechanisms as values are stored as 8-byte integers, I would expect. Use integers because they are efficient. Use the database implementation of UUIDs in addition for any external reference to obfuscate. Chris Russell responded to the original post on this section correctly noting two important caveats or errors in logic. First, even exposing a UUID that is effectively an alternate for the actual PK reveals information, and this is especially true when using the newsequentialid — don’t use UUIDs for security. Second, when the relations of a given schema are internally managed by integer keys, you still have the key-collision problem of merging two databases, unless all keys are doubled … in which case, just use the UUID. So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever expose them. The external/internal thing is probably best left to things like friendly-url treatments, and then (as Medium does) with a hashed value tacked on the end. Thanks Chris!

References and many thanks Thanks to Ruby Weekly (which I still read, wistfully although Scala is growing on me), Starr Horne’s great blog from Honeybadger.io on this topic, the always funny and smart post on Coding Horror by Jeff Atwood, co-founder of Stack Overflow, and naturally a fine question on one of Stackoverflow’s sites at dba.stackexchange.com. Also a nice post from MySqlserverTeam, another from theBuild.com and of course MSDN which I linked earlier.

Meta: Why I Blog I learned a lot writing about this. I started out reading email on a Sunday afternoon. Then came across an interesting post by Starr, which got me thinking his advice might have unintended outcomes. So I googled and learned way more about UUIDs than I knew before, and changed my fundamental understanding and disposition about how and when to use them. Halfway through writing this, I sent email to the team leads at my company wondering if we had considered one of the topics I discussed. Hopefully we’re ok, but I think we may have avoided at least one unexpected surprise in code scheduled for release this week. Note that all of these are entirely selfish reasons :-) Hope you like it, too!