jeroen / openssl

OpenSSL bindings for R
Other
63 stars 20 forks source link

Vectorize bignum() #53

Closed kevinykuo closed 6 years ago

kevinykuo commented 6 years ago

Got a somewhat off-label use case here. I'm trying to encode large cardinality categorical features (that come as character columns) into hash buckets. This involves calling bignum() then %% on the hash output. Would it be possible to vectorize bignum() to make this more efficient or is there a better way I overlooked?

jeroen commented 6 years ago

Each bignum is already a raw vector under the hood, so it cannot easily be vectorized over multiple bignums. We would have to introduce a new datastructure for lists of bignums, which would not give you much performance gain.

Can you illustrate with some code what it is you would like to do exactly?

kevinykuo commented 6 years ago

@jeroen Thanks for the reply, that makes sense. I'm basically doing something like this:

string_to_hash_bucket <- function(x, num_buckets) {
  r <- sapply(x, function(s) bignum(md5(s), hex = TRUE) %% bignum(num_buckets))
  as.integer(r)
}

v <- sample(letters, 100, replace = TRUE)
string_to_hash_bucket(v, 10)
# [1] 9 3 9 1 8 1 3 7 3 7 7 3 5 6 7 8 9 3 9 6 3 3 5 3 1 7 3 3 6 3 8 8 3 3 3 9
# [37] 7 1 6 6 9 1 6 9 1 8 8 8 6 3 5 5 5 3 3 9 5 7 3 1 5 6 6 3 8 3 8 6 5 5 8 8
# [73] 9 6 8 8 3 3 8 3 5 3 7 3 8 8 8 3 7 9 7 3 8 3 1 8 1 9 9 5