Closed rafaelrabeloit closed 9 years ago
Ok, I think I'm starting to get the idea... Thanks!
+1!
Also worth noticing uuids present another layer of defense if you forget to scope a query, which is a pretty common mistake even big companies make:
An additional non-technical reason is that as a company grows and gets attention of competitors, numeric IDs can allow people to discover the relative size of your data based on IDs of newly-created records. Analysts often use this method to estimate how much revenue a company earns too. UUIDs aren't the only solution here, but in the context of an API you'd need to use something other than the numeric ID either way, so UUID is a suitable alternative, especially in the context of the others reasons to use them.
+1 for code design that doesn't betray it's inner functionality. It's also worth mentioning that UUIDs and GUIDs have a defined standard/algorithm and are not simply a random series of 32 hex digits:
/memberships?user=123&team=456
vs. /memberships?user=1b2d9fb0-d232-49d5-9e60-334bc16d79bc&team=6f6f3d93-df18-495f-8de4-fa29cb2e5835
You make it sound like it's a no-brainer, when it's not. The advantages are accidental (for example, I don't care about 3rd parties analyzing our well-being using the resource ID's, because, firstly, I don't mind, and secondly, they'll find a way), and there's nothing you can do about the aesthetics, if you have no other unique identifier for a resource.
I was still thinking about it... If you open your API to the public, obviously you can't create the UUIDs in the client, because you can't assume that the UUIDs will be generated in the way you'd expect.
Idk about the database scope, if you consider the distributed case, though.
For all other arguments, simply append a random number with fixed length to the resource id (and persisting it with the id itself in the database, maybe as a composite key), this will mask your id, the size of your database and will prevent the attacker to iterate over all your entries, e.g.:
id + 6 digits random number: 1 + 005174 = 1005174, like /user/1005174 Even if the attacker knows the size of the random number, he won't know the number itself. So, he wouldn't know the id 2 + rand (to iterate), or the id 545684 + rand (to try to guess the database size).
I don't care about aesthetics, because I belive the APIs are for client software and not users, but a 36 char string seems like a overkill to me. And to think that, with more and more entries in your database, the collision chance increases, makes me uneasy. So, if you think in Google parameters, the number of database entries must cause collisions, even with something as improbable as UUID...
Regarding aesthetics - yes, they don't matter on the API level, but then you still need routing in your client-side application, right? Let's take a user resource: the service I'm making allows duplicate usernames, while user's email is a private piece of info, so all I'm left with is some unique identifier, and I'd prefer it to be a short number / hash (think Trello), and not 36 char long gibberish.
IMO, this is bad (ignore the product identifier, you get the point):
And this is good:
In our case at least we expect all the uuids to be generated by us, server side, so the client concerns did not matter. I also agree that not leaking information about how many things you might have is pretty incidental, not really important for most use cases (but matters to some people). Similarly, preventing an attacker from iterating is nice-to-have, but ideally you have enough other protections in place that you would be ok even if they knew keys, so again incidental benefit.
The biggest reason for us, I think, is that it makes it more feasible to shard later as one grows than integers. And if you don't do it sooner, rather than later, the pain/difficulty of later having to switch is pretty bad. So the hope was to head off that issue at the pass and just start with something that should work into the future. Even though each service might be able to have it's own id's, any of the individual services might still grow to the size where sharding would become necessary, so simply dividing things up might delay but I don't think would be able to for-sure prevent this from becoming an issue.
The aesthetics issue is one that bothers me as well. I don't particularly like the way they look and they are quite long. Which in some cases becomes concretely problematic, rather than just ugly, for instance due to a somewhat small limit on total size of query string (though this can be worked around by doing POST with this info in the body, it still seems not-great). I still felt using something that should be able to scale more easily (as well as having some of these nice other properties) out-weighed un-aesthetic things in a context that will mostly be written/created by computers rather than humans. I think if this were being exposed more in web pages it might well be another story.
I suppose if you feel that it is likely that your dataset would never need to grow beyond the bounds of a single database it would lessen some of these pressures, but I was unwilling to make that bet.
Instagram have an interesting blog post about their ID generation. Instagram IDs are shorter, (subjectively) more aesthetically pleasing, and shard ready.
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
That might be an appropriate alternative for those seeking to avoid UUIDs.
Just to chip in here, I also find UUID to be pretty ugly, and their primary use case (allowing distributed clients to generate IDs with a very low chance of collisions) isn't one that I've really come across.
UUIDs imply (in the JSON at least) that they're strings, but they're actually 128 bit values, and whilst many databases / storage engines support UUIDs natively (e.g. Postgres does, but SQLite doesn't) , it's a bit less common than storing integers, and many users of your API might just store them as strings, which is probably ok, but might not scale as well?
On the other hand, 64 bit integers can't always be parsed in javascript environments as an integer if they're above 53 bits, so Twitter always includes a string version with a _str
suffix (see https://dev.twitter.com/overview/api/twitter-ids-json-and-snowflake ).
Yeah, I was about to mention snowflake/twitter as another case.
Distributed id generation is definitely not part of why we wanted unique stuff. Mostly future-proofing and as a means of having consistency, other stuff is more periphery. We chose it over snowflake/etc at least in part because we use postgres and so we already had easy native support.
They are ugly though, for sure. I guess I'm just on the fence about whether that is a strong enough reason to do something more complicated, since they will mostly only be "seen" by computers. I suppose it depends on if the API is then revealed in user facing APIs, where uuids would be more unfortunate.
Why is that? Why not use an auto-incrementing ID if the resource have to be identified by a resource name in the URL? The resource is identified by its path, I believe. So /orgs/1 it is complete unique and completely different from the /users/1 With RESTful & microservice architecture, having an unique ID for a resource is too harsh because I can have several distinct databases, fine tuned to each kind of service, and to guarantee uniqueness in this kind of situation within the scope of the entire system... It just doesn't seems to payoff. There is a specific reason for this?