ocaml-batteries-team / batteries-included

Batteries Included project
http://ocaml-batteries-team.github.com/batteries-included/hdoc2/
Other
518 stars 106 forks source link

hash functions in modules #492

Open c-cube opened 10 years ago

c-cube commented 10 years ago

having a sensible hash function in batInt, batString, etc. would make many functor applications easier (hashtbl and weak tables, for instance). Also, it can help make hashing specific to types:

let hash (i : int) = i land max_int
let hash (s : string) = (* function that hashes a string, or just Hashtbl.hash *)

(* ... *)

let hash (b : Big_int.big_int) = String.hash (Big_int.string_of_big_int b)

A very useful thing to have, also, is hash combinators (example) for lists, arrays, enums, similarly to printing functions.

gasche commented 10 years ago

Re. different hashing functions: Hashing is a subtle and difficult topic. People want speed, low number of conflicts, and (sometimes) conflict-attack security, and it's not possible to have all of them at once. Unless we have hashing experts that are really interested in the matter and can do something significantly better than just using Hashtbl.hash, I think we're better off following the standard library.

Re. hashing combinators for polymorphic data structures; that seems to be a reasonable idea, as it helps people that use specific hashing functions on their datatype of choice lift those to Batteries-provided datatypes. I am not quite sure which interface we would need for this and how it should work -- probably by exposing the "mixing" logic of the hash function that allows to add further data to be taken into account, but how resilient is that to changes in the hash function's design? The compiler libraries don't expose those primitives so we would need to add new external runtime dependencies for these.

I'm thinking of an interface such as, approximatively:

module type HashData = sig
  type t
  type initial
  type final
  val make : initial -> t
  val mix_int32 : t -> int32 -> t
  val mix_int64 : t -> int64 -> t
  val mix_intnat : t -> nativeint -> t
  val mix_float : t -> float -> t
  val mix_string : t -> string -> t
  val finalize : t -> final
end

The standard library would expose its hashing function with initial = int option (for using a seed or not) and final = unit (AFAIK the stdlib hash currently doesn't require any additional data/seeding to perform its final mix).

c-cube commented 10 years ago

I like the idea of having a "partial hash type", it's very useful for many hash functions anyway. I think the "Hashable" interface should be something like

module type Hashable = sig
  type t
  val mix : HashData.t -> t -> HashData.t
  val hash : t -> int

and

module HashData : sig
  type t
  val mix_int : t -> int -> t
  val mix_int32 : t -> int32 -> t
  (* ..... *)

  val make : int -> t
  val finalize : t -> int
end

or something more general like what you said. The function Hashable.hash is a shortcut for creating a HashData.t, mixing t in, and finalizing it.

gasche commented 10 years ago

First, let me cancel "HashData" as a name for this interface. It's terrible and I think HashImpl would be more appropriate: it exposes the particular of a hashing function. Or PartialHash.

Then I'll note that I feel like we're opening a whole can of worms with this HashImpl design. If the goal is simply to let users say "I have a BatSplayTree of my own datatype HighlyOptimizedClauses.t wish has this domain-specific hash function", we should be able to compute a hash using this domain-specific hash function without telling the world about how we also know how to mix doubles and strings. We could just require the domain-specific hash function to return an integer (or a union type of the types we know how to mix, but isn't that premature complexity?), and mix that with the compiler-provided hash function.

The alternate view I was having above and you expose explicitly in your answer is to see the user's domain-specific hashable type not as "can be hashed to an integer" but "can be mixed into a partial hash value". Then we have to let the user know how to mix, and ask the user to do the mixing in our stead. This can give better hashing properties, but at the cost of an important increase in complexity.

Note that none of the above requires Batteries code to be parametrized over the hash functions it itself uses its own datastructure: the elements are hashed as the user says, but the final result is still combined with the standard hashing function. We can instead try to functorize everything over a module with the HashImpl signature, allowing people to also change the mixing function used to combine list element values, etc. But that's yet more complex -- and possibly of doubtful utility, as there has been little interest in changing the hash functions so far, only pressure to use a balanced tree of buckets to avoid collision attacks.

Finally, we didn't consider how to limit the depth on which the structure is traversed to collect a hash -- hash_param in the standard library.

c-cube commented 10 years ago

I agree that using modules and abstract types make things very complicated. Again, a distinction has to be made between cryptographic/secure hashes (which require careful design), and a "simple" default hash function such as sdbm or dbj2 which could be used to compute hashes for Hashtbl, not for computing checksums or hashing passwords (cryptokit would be better).

Then, as you said, hash combinators would allow both the simple, default element hash function, or a user-provided one to be used. In this case, containers (e.g. List) could expose:

val hash_mix : (int -> 'a -> int) -> int -> 'a list -> int
val hash : (int -> 'a -> int) -> 'a list -> int

for respectively combine into an unfinalized hash, or compute a whole hash of the list.

c-cube commented 10 years ago

I forgot to add that for some structures, a user-defined hash function is mandatory because the regular hash function isn't compatible with the equality. For instance, two Set/Map instances may have distinct internal structures but be structurally equal, and the correct way of hashing them is using a prefix traversal.

gasche commented 10 years ago

That's right. When replying to your first post, I considered this problem and wrongly thought that the two equivalent structures would fall into the same bucket, and thus be handled as collisions by running the user-provided (and equivalence-aware) equality function on them, but that doesn't work if they don't actually have the same hash.

UnixJunkie commented 10 years ago

For the name, why not HashItf or HashIntf since it is an interface?

UnixJunkie commented 10 years ago

Or Hashable even

thelema commented 10 years ago

I have some domain expertise on hashing, and Simon is right that non-crypto hashing is much easier than crypto hashes. That said, we should be able to parameterize containers on both the mixing function and per-element hashing. Something like:

type mix_int : int -> int -> int

... val hash_gen : mix_int -> ('a -> int) -> 'a t -> int

Being able to seed the hash is nice, although the usefulness depends on the mixing function used. I'd be fine leaving this off until someone needs it. Finalizing the hashing into a smaller state isn't particularly important for non-crypto hashes. The general structure of hashing values into integers and mixing these integers together in some order until a single result is obtained is easy to get good results with.

There's some good integer mixing functions out there: http://burtleburtle.net/bob/hash/integer.html It wouldn't hurt to have these in ocaml; hopefully the missing bit doesn't cause too much trouble.

E.

On Sun, Jan 5, 2014 at 8:19 PM, Francois Berenger notifications@github.comwrote:

Or Hashable even

— Reply to this email directly or view it on GitHubhttps://github.com/ocaml-batteries-team/batteries-included/issues/492#issuecomment-31621791 .

c-cube commented 10 years ago

So, a BatHash module with

type mix_int = int -> int -> int

type 'a t = mix_int -> 'a -> int

val mix1 : mix_int
val mix2 : mix_int
...                      

and other modules implementing the t BatHash.t, for instance

module List : sig
  ...                   
    val hash : 'a BatHash.t -> 'a list BatHash.t
 end

would satisfy everyone here? The same module could also include later a distinct, more complicated type for cryptographic hashes.

thelema commented 10 years ago

The one downside in this design is that it doesn't make much sense to have a int BatHash.t take a mix_int as argument. Maybe separate that as another parameter to List.hash: val hash : BatHash.mix_int -> 'a BatHash.t -> 'a list BatHash.t and leave type BatHash.t = 'a -> int?

On Sun, Jan 12, 2014 at 12:41 PM, Simon Cruanes notifications@github.comwrote:

So, a BatHash module with

type mix_int = int -> int -> int type 'a t = mix_int -> 'a -> int val mix1 : mix_intval mix2 : mix_int...

and other modules implementing the t BatHash.t, for instance

module List : sig ... val hash : 'a BatHash.t -> 'a list BatHash.t end

would satisfy everyone here? The same module could also include later a distinct, more complicated type for cryptographic hashes.

— Reply to this email directly or view it on GitHubhttps://github.com/ocaml-batteries-team/batteries-included/issues/492#issuecomment-32128108 .

gasche commented 10 years ago

I think that I would like it better if we could keep the internal type of the hash function abstract. Not only would that help when programming, it would also make us able to support hash functions that really need 32 or 64 bits to be defined.

thelema commented 10 years ago

@gasche keeping internal type abstract is great for the more complex hash case, with a seed, post-processing, etc. For hash table use, hashing speed is important, and I expect we'll have to pay in speed for the abstraction.

E.

On Mon, Jan 13, 2014 at 3:42 AM, gasche notifications@github.com wrote:

I think that I would like it better if we could keep the internal type of the hash function abstract. Not only would that help when programming, it would also make us able to support hash functions that really need 32 or 64 bits to be defined.

— Reply to this email directly or view it on GitHubhttps://github.com/ocaml-batteries-team/batteries-included/issues/492#issuecomment-32152173 .

c-cube commented 10 years ago

@thelema I agree. If required, a module could include several hash functions, that use distinct mixing functions, but overall simplicity beats genericity in this case. Hashing is mostly useful for hashtables, so it needs to be simple and easy to use (t -> int then).