jeroen / openssl

OpenSSL bindings for R
Other
63 stars 19 forks source link

Port libcrypto to openssl #2

Closed Ironholds closed 9 years ago

Ironholds commented 9 years ago

What it says on the tin. Starting point is an example (say, MD5) in pure C; I'll port the rest after that.

jeroen commented 9 years ago

It seems like in your current implementation you assume the input is a string. Perhaps it would be more general if the input would be a raw() vector? We can then write a simple wrapper that uses charToRaw on the R level to make it work for strings, but it would also work for other types. For example digest uses serialize() to convert an arbitrary objects to raw.

Ironholds commented 9 years ago

I mean, we can - would it make a difference, though? That is, would we expect a different hash for 66 6f 6f versus the hash for "foo"? If so, I'd rather we find some way of making it /optional/, default off, (or just stick with hard-converting everything to character): digest's default serialization ensures that hashes outputted for a given string run through digest are not actually comparable to the results from any other implementation.

jeroen commented 9 years ago

Ok i'm working on an initial version and then we can go from there :)

jeroen commented 9 years ago

Ok I got it to work. I took the advise from the openssl documentation:

The EVP digest routines are a high level interface to message digests, and should be used instead of the cipher-specific functions.

So there is now a single digest function rather than algo specific bindings.

Ironholds commented 9 years ago

I've looked at the new version; I'd suggest changing the name to something like crypto to match the underlying module (and to avoid running into name collisions. "Just use digest" "It doesn't do that" "No, the function digest that isn't in the package digest, not the function digest that is in the package digest".

It's certainly more cryptographically secure than the digest implementation, but the entire reason for the C++-looping in my approach is the cost of function calling and looping in R; again, it's possibly to get an order-of-magnitude increase in speed between equivalent vectorised operations by putting them in C(++), rather than relying on vectorisation through the apply family. Without that C-based vectorisation, all we've done is created something more cryptographically secure, but just as slow.

jeroen commented 9 years ago

I don't think crypto secure matters for digest hashes because there is no randomness; its just a deterministic algorithm.

We can easily port some stuff further to C, but I highly doubt that it will make a difference, definitely not an order of magnitude. Most of the computing goes into the actual hashing algorithm, not so much the R function call. Note that there is no for loop in the R code either. We should definitely benchmark first before making such claims.

jeroen commented 9 years ago

When you use vapply, you basically already let R do the looping in C.

Ironholds commented 9 years ago

So it's not having to call the function from R each time, and cope with the overhead of that? Let's definitely benchmark it: I compared cryptohash and digest here and found a substantial delta. It'd be interesting to compare cryptohash and openssl and see how much of that is down to digest having an inefficient implementation, and how much is down to the overhead of looping within R.

jeroen commented 9 years ago

Maybe so. There is some overhead in calling R functions, but I'm not sure how it compares to the hashing time. Let's test and see.

jeroen commented 9 years ago

Also if it is really just porting the loop to C++ that does not quite justify a new package or JSS paper of course. If performance would be an issue, Dirk could easily add support for vectorization in the digest package and then our argument is gone.

jeroen commented 9 years ago

Okay I ported the looping to C and renamed the functions to respectively hash and rawhash. According to my benchmark this version is now both faster than digest and cryptohash.

Ironholds commented 9 years ago

Cool! Definitely faster than digest; seems slightly-faster-but-roughly-equivalent to cryptohash with more extensive benchmarking (using microbenchmark and 100 iterations, rather than 10). I'll expand the documentation and add some internal tests to make sure it treats lists/dfs correctly and warns on NA/NULL entries.

Ironholds commented 9 years ago

(What's the use case for rawhash?)

jeroen commented 9 years ago

I realize you already made unit tests, but I was thinking about this, and maybe it is more elegant if we do not export the hash function, and instead expose specific wrappers to the user for md5() sha1(), etc?

The benefits of this would be that the name is less ambiguous than hash or digest, and the user cannot specify a wrong algo argument. But providing a simple md5(mystring) might also make it easier to find and use our package. I remember it took me a long time to find the digest package because I was looking for a md5 function.

jeroen commented 9 years ago

Also can you explain the purpose of rand_val? I really don't understand how you would use that.

Ironholds commented 9 years ago

I think version-specific functions makes total sense :). Tweaking the tests won't take me very long. Do you think we should do it for the whole range of algorithms or just the obvious ones (SHA-1, 256 and 512, MD5)? I can see an argument for not building in support for some of the weaker ones to avoid encouraging their use.

Rand_val is for the following use case: you have an identifier in a dataset, and it's sensitive - either because it contains PII or because it's used multiple times in the system, or both. So you want to be able to hash it (retaining the uniqueness of the identifier), while also including some hard-to-replicate offset (to avoid being able to join multiple, released datasets with the same identifier together). I've run into it a few times in the last couple of months relating to session reconstruction datasets: they consist entirely of [uuid], [timestamp], which is fine until you remember that some of the events represented are in other datasets, and realise someone could combine the sets to de-anonymise parts of it, even if the UUIDs are hashed.

jeroen commented 9 years ago

I think the term for that is a salt?

Ironholds commented 9 years ago

Yep; better way of putting it :). I'll tweak it this afternoon (heading out for the next...4 hours, looks like.

Ironholds commented 9 years ago

Okay, renamed rand_val to salt, and also split out the error checking/handling/salt generation into its own function (prepping for us turning hash into md5 et al). Two open questions; (1) what's the rawhash use case and (2) do you think we should support the full gamut of openssl hashing algorithms, or just the obvious ones (md5,sha1, sha256, sha512)?

jeroen commented 9 years ago

The rawhash is the natural interface because a hash is a set of bytes. It is actually quite inefficient to put it in a string with ascii hex characters. In most applications people translate to a string for convenience, but in some cases you need the actual bytes.

I think we should just expose all of the algorithms that are supported by openssl (unless they are likely to be removed in upcoming versions of libcrypto). Everyone knows that some of them are no longer safe, but it might still be useful to be able to check them, for example if a user is working with some old dataset which includes md4 hashes. Lets just use a single manual page for all of them to not bloat the manual too much.

Ironholds commented 9 years ago

Makes sense. I'm kind of unsure about "everyone"; the audience here is any researcher, not cryptographic researchers.

jeroen commented 9 years ago

Done.

jeroen commented 9 years ago

It now looks like this: http://jeroenooms.ocpu.io/openssl/man/hash/text

Ironholds commented 9 years ago

Hah! Just pulled to implement this. Cool :)