voxpupuli / json-schema

Ruby JSON Schema Validator
MIT License
1.52k stars 242 forks source link

Removed uuidtools #344

Open iainbeeston opened 8 years ago

iainbeeston commented 8 years ago

Right now, we cache schemas, and the key we use to cache them is either the URL we got them from, or if there's no URL, we generate a UUID from the json text of the schema. We're basically using a v5 UUID as a hashing function. What's more, json-schema includes it's own UUID library for doing this.

I've swapped that out, for generating a SHA1 hash instead. It should be fast, although in my own benchmarks, the speed is comparable to the current implementation (real world speed may depend on the size of the schemas being cached). At the very least, I believe it's clearer to use SHA1 (rather than UUID) and it also means we don't need an external ruby file copied into the project.

iainbeeston commented 7 years ago

@RST-J Do you have any thoughts about this? It will allow us to remove another dependency (although this one is included inside json-schema itself)

RST-J commented 7 years ago

Removing the dependency is a good idea I think, we didn't use anything else from that and your replacement fits the purpose. In terms of speed, MD5 is faster, so maybe we could take just this? I mean, there is no security concern here (or am I missing something?). But there is a comment in the removed library which suggests not to do that:

UUID generation using SHA1. Recommended over create_md5.

As I dont't know why and there is no reason given, I'm not sure whether this performance improvement is worthwhile or not. What do you know and think about that?

iainbeeston commented 7 years ago

I'm actually starting to have doubts about this now.

Perhaps we should be using String.hash, which would be an order of magnitude faster. So far as I'm aware String.hash should realistically not have collisions (although I'm struggling to find good evidence of that)

Also, it might be better to generate not just a hex string but a uri with a fake scheme that is immediately identifiable as an internally calculated value (something like "ref://af562bbc" or "internal://539daf54dd"?)

iainbeeston commented 7 years ago

Incidentally, @mjc did suggest using String.hash in a previous pull request (any credit for the idea should go to him!)

iainbeeston commented 7 years ago

I've now changed my mind entirely - I think we should simply use String.hash.to_s(16) (with an appropriate scheme to make it look like a URI).

String.hash in MRI ruby uses siphash-2-4, which should be good enough for our needs and much faster than SHA1

RST-J commented 7 years ago

I think, using a custom scheme is a good idea. It prevents accidental collisions, no matter how improbable they are. And, if String.hash turns out to be not collision resistant enough, again no matter how improbable, one can always circumvent the immediate problem by assigning an URI and we can mitigate the problem if it ever gets reported. As this is internal, I assume no one should rely on the details.