interagent / http-api-design

HTTP API design guide extracted from work on the Heroku Platform API
https://geemus.gitbooks.io/http-api-design/content/en/
Other
13.68k stars 1.07k forks source link

Why use UUID? #79

Closed rafaelrabeloit closed 9 years ago

rafaelrabeloit commented 9 years ago

Give each resource an id attribute by default. Use UUIDs unless you have a very good reason not to. Don’t use IDs that won’t be globally unique across instances of the service or other resources in the service, especially auto-incrementing IDs.

Why is that? Why not use an auto-incrementing ID if the resource have to be identified by a resource name in the URL? The resource is identified by its path, I believe. So /orgs/1 it is complete unique and completely different from the /users/1 With RESTful & microservice architecture, having an unique ID for a resource is too harsh because I can have several distinct databases, fine tuned to each kind of service, and to guarantee uniqueness in this kind of situation within the scope of the entire system... It just doesn't seems to payoff. There is a specific reason for this?

gkirill commented 9 years ago
  1. Incremental ids are usually only unique within one instance of database. If your db is sharded, then it may become difficult to use incremental ids because you may have user/1 in instance 1 and user/1 in instance 2.
  2. Another useful application of uuids is that clients can define them themselves when they create new instances, e.g. I could create new user with JavaScript and set its id right there without needing db to assign an id.
rafaelrabeloit commented 9 years ago

Ok, I think I'm starting to get the idea... Thanks!

pedro commented 9 years ago

+1!

Also worth noticing uuids present another layer of defense if you forget to scope a query, which is a pretty common mistake even big companies make:

http://mashable.com/2015/04/28/twitter-earnings-selerity/

bjeanes commented 9 years ago

An additional non-technical reason is that as a company grows and gets attention of competitors, numeric IDs can allow people to discover the relative size of your data based on IDs of newly-created records. Analysts often use this method to estimate how much revenue a company earns too. UUIDs aren't the only solution here, but in the context of an API you'd need to use something other than the numeric ID either way, so UUID is a suitable alternative, especially in the context of the others reasons to use them.

crazytonyi commented 9 years ago

+1 for code design that doesn't betray it's inner functionality. It's also worth mentioning that UUIDs and GUIDs have a defined standard/algorithm and are not simply a random series of 32 hex digits:

https://en.m.wikipedia.org/wiki/Globally_unique_identifier

alecmev commented 9 years ago
  1. The probability of a collision with UUID's is not 0 + you can't trust client-side generated ID's (their random generator could be returning the same number with every invocation, for all we know), so you still need to do the appropriate checks (and have a mechanism for denying a resource creation, if the ID collides, just like with the regular integers...)
  2. Performance varies from one DB type to another, but, for example, you still need a good old autoincrement integer in your Postgres (while the situation is even worse in MySQL and MSSQL, from what I've read)
  3. You completely ruin the aesthetics of your URL's: /memberships?user=123&team=456 vs. /memberships?user=1b2d9fb0-d232-49d5-9e60-334bc16d79bc&team=6f6f3d93-df18-495f-8de4-fa29cb2e5835

You make it sound like it's a no-brainer, when it's not. The advantages are accidental (for example, I don't care about 3rd parties analyzing our well-being using the resource ID's, because, firstly, I don't mind, and secondly, they'll find a way), and there's nothing you can do about the aesthetics, if you have no other unique identifier for a resource.

rafaelrabeloit commented 9 years ago

I was still thinking about it... If you open your API to the public, obviously you can't create the UUIDs in the client, because you can't assume that the UUIDs will be generated in the way you'd expect.

Idk about the database scope, if you consider the distributed case, though.

For all other arguments, simply append a random number with fixed length to the resource id (and persisting it with the id itself in the database, maybe as a composite key), this will mask your id, the size of your database and will prevent the attacker to iterate over all your entries, e.g.:

id + 6 digits random number: 1 + 005174 = 1005174, like /user/1005174 Even if the attacker knows the size of the random number, he won't know the number itself. So, he wouldn't know the id 2 + rand (to iterate), or the id 545684 + rand (to try to guess the database size).

I don't care about aesthetics, because I belive the APIs are for client software and not users, but a 36 char string seems like a overkill to me. And to think that, with more and more entries in your database, the collision chance increases, makes me uneasy. So, if you think in Google parameters, the number of database entries must cause collisions, even with something as improbable as UUID...

alecmev commented 9 years ago

Regarding aesthetics - yes, they don't matter on the API level, but then you still need routing in your client-side application, right? Let's take a user resource: the service I'm making allows duplicate usernames, while user's email is a private piece of info, so all I'm left with is some unique identifier, and I'd prefer it to be a short number / hash (think Trello), and not 36 char long gibberish.

IMO, this is bad (ignore the product identifier, you get the point): bad

And this is good: good

geemus commented 9 years ago

In our case at least we expect all the uuids to be generated by us, server side, so the client concerns did not matter. I also agree that not leaking information about how many things you might have is pretty incidental, not really important for most use cases (but matters to some people). Similarly, preventing an attacker from iterating is nice-to-have, but ideally you have enough other protections in place that you would be ok even if they knew keys, so again incidental benefit.

The biggest reason for us, I think, is that it makes it more feasible to shard later as one grows than integers. And if you don't do it sooner, rather than later, the pain/difficulty of later having to switch is pretty bad. So the hope was to head off that issue at the pass and just start with something that should work into the future. Even though each service might be able to have it's own id's, any of the individual services might still grow to the size where sharding would become necessary, so simply dividing things up might delay but I don't think would be able to for-sure prevent this from becoming an issue.

The aesthetics issue is one that bothers me as well. I don't particularly like the way they look and they are quite long. Which in some cases becomes concretely problematic, rather than just ugly, for instance due to a somewhat small limit on total size of query string (though this can be worked around by doing POST with this info in the body, it still seems not-great). I still felt using something that should be able to scale more easily (as well as having some of these nice other properties) out-weighed un-aesthetic things in a context that will mostly be written/created by computers rather than humans. I think if this were being exposed more in web pages it might well be another story.

I suppose if you feel that it is likely that your dataset would never need to grow beyond the bounds of a single database it would lessen some of these pressures, but I was unwilling to make that bet.

bjeanes commented 9 years ago

Instagram have an interesting blog post about their ID generation. Instagram IDs are shorter, (subjectively) more aesthetically pleasing, and shard ready.

http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

That might be an appropriate alternative for those seeking to avoid UUIDs.

frankieroberto commented 9 years ago

Just to chip in here, I also find UUID to be pretty ugly, and their primary use case (allowing distributed clients to generate IDs with a very low chance of collisions) isn't one that I've really come across.

UUIDs imply (in the JSON at least) that they're strings, but they're actually 128 bit values, and whilst many databases / storage engines support UUIDs natively (e.g. Postgres does, but SQLite doesn't) , it's a bit less common than storing integers, and many users of your API might just store them as strings, which is probably ok, but might not scale as well?

On the other hand, 64 bit integers can't always be parsed in javascript environments as an integer if they're above 53 bits, so Twitter always includes a string version with a _str suffix (see https://dev.twitter.com/overview/api/twitter-ids-json-and-snowflake ).

geemus commented 8 years ago

Yeah, I was about to mention snowflake/twitter as another case.

Distributed id generation is definitely not part of why we wanted unique stuff. Mostly future-proofing and as a means of having consistency, other stuff is more periphery. We chose it over snowflake/etc at least in part because we use postgres and so we already had easy native support.

They are ugly though, for sure. I guess I'm just on the fence about whether that is a strong enough reason to do something more complicated, since they will mostly only be "seen" by computers. I suppose it depends on if the API is then revealed in user facing APIs, where uuids would be more unfortunate.