Closed Ironholds closed 9 years ago
It seems like in your current implementation you assume the input is a string. Perhaps it would be more general if the input would be a raw()
vector? We can then write a simple wrapper that uses charToRaw
on the R level to make it work for strings, but it would also work for other types. For example digest
uses serialize()
to convert an arbitrary objects to raw.
I mean, we can - would it make a difference, though? That is, would we expect a different hash for 66 6f 6f versus the hash for "foo"? If so, I'd rather we find some way of making it /optional/, default off, (or just stick with hard-converting everything to character): digest's default serialization ensures that hashes outputted for a given string run through digest are not actually comparable to the results from any other implementation.
Ok i'm working on an initial version and then we can go from there :)
Ok I got it to work. I took the advise from the openssl documentation:
The EVP digest routines are a high level interface to message digests, and should be used instead of the cipher-specific functions.
So there is now a single digest
function rather than algo specific bindings.
I've looked at the new version; I'd suggest changing the name to something like crypto
to match the underlying module (and to avoid running into name collisions. "Just use digest" "It doesn't do that" "No, the function digest that isn't in the package digest, not the function digest that is in the package digest".
It's certainly more cryptographically secure than the digest implementation, but the entire reason for the C++-looping in my approach is the cost of function calling and looping in R; again, it's possibly to get an order-of-magnitude increase in speed between equivalent vectorised operations by putting them in C(++), rather than relying on vectorisation through the apply family. Without that C-based vectorisation, all we've done is created something more cryptographically secure, but just as slow.
I don't think crypto secure matters for digest hashes because there is no randomness; its just a deterministic algorithm.
We can easily port some stuff further to C, but I highly doubt that it will make a difference, definitely not an order of magnitude. Most of the computing goes into the actual hashing algorithm, not so much the R function call. Note that there is no for
loop in the R code either. We should definitely benchmark first before making such claims.
When you use vapply
, you basically already let R do the looping in C
.
So it's not having to call the function from R each time, and cope with the overhead of that? Let's definitely benchmark it: I compared cryptohash and digest here and found a substantial delta. It'd be interesting to compare cryptohash and openssl and see how much of that is down to digest having an inefficient implementation, and how much is down to the overhead of looping within R.
Maybe so. There is some overhead in calling R functions, but I'm not sure how it compares to the hashing time. Let's test and see.
Also if it is really just porting the loop to C++ that does not quite justify a new package or JSS paper of course. If performance would be an issue, Dirk could easily add support for vectorization in the digest
package and then our argument is gone.
Okay I ported the looping to C
and renamed the functions to respectively hash
and rawhash
. According to my benchmark this version is now both faster than digest
and cryptohash
.
Cool! Definitely faster than digest; seems slightly-faster-but-roughly-equivalent to cryptohash with more extensive benchmarking (using microbenchmark and 100 iterations, rather than 10). I'll expand the documentation and add some internal tests to make sure it treats lists/dfs correctly and warns on NA/NULL entries.
(What's the use case for rawhash?)
I realize you already made unit tests, but I was thinking about this, and maybe it is more elegant if we do not export the hash
function, and instead expose specific wrappers to the user for md5()
sha1()
, etc?
The benefits of this would be that the name is less ambiguous than hash
or digest
, and the user cannot specify a wrong algo
argument. But providing a simple md5(mystring)
might also make it easier to find and use our package. I remember it took me a long time to find the digest package because I was looking for a md5
function.
Also can you explain the purpose of rand_val
? I really don't understand how you would use that.
I think version-specific functions makes total sense :). Tweaking the tests won't take me very long. Do you think we should do it for the whole range of algorithms or just the obvious ones (SHA-1, 256 and 512, MD5)? I can see an argument for not building in support for some of the weaker ones to avoid encouraging their use.
Rand_val is for the following use case: you have an identifier in a dataset, and it's sensitive - either because it contains PII or because it's used multiple times in the system, or both. So you want to be able to hash it (retaining the uniqueness of the identifier), while also including some hard-to-replicate offset (to avoid being able to join multiple, released datasets with the same identifier together). I've run into it a few times in the last couple of months relating to session reconstruction datasets: they consist entirely of [uuid], [timestamp], which is fine until you remember that some of the events represented are in other datasets, and realise someone could combine the sets to de-anonymise parts of it, even if the UUIDs are hashed.
Yep; better way of putting it :). I'll tweak it this afternoon (heading out for the next...4 hours, looks like.
Okay, renamed rand_val to salt, and also split out the error checking/handling/salt generation into its own function (prepping for us turning hash
into md5
et al). Two open questions; (1) what's the rawhash use case and (2) do you think we should support the full gamut of openssl hashing algorithms, or just the obvious ones (md5,sha1, sha256, sha512)?
The rawhash
is the natural interface because a hash is a set of bytes. It is actually quite inefficient to put it in a string with ascii hex characters. In most applications people translate to a string for convenience, but in some cases you need the actual bytes.
I think we should just expose all of the algorithms that are supported by openssl (unless they are likely to be removed in upcoming versions of libcrypto). Everyone knows that some of them are no longer safe, but it might still be useful to be able to check them, for example if a user is working with some old dataset which includes md4
hashes. Lets just use a single manual page for all of them to not bloat the manual too much.
Makes sense. I'm kind of unsure about "everyone"; the audience here is any researcher, not cryptographic researchers.
Done.
It now looks like this: http://jeroenooms.ocpu.io/openssl/man/hash/text
Hah! Just pulled to implement this. Cool :)
What it says on the tin. Starting point is an example (say, MD5) in pure C; I'll port the rest after that.