Variable hashes have too many collisions

arkocal commented 3 days ago

Summary: With about 60k variables encountered, we have about 1.04 entries per hash. With an ideal hash function the chances for even a single collision would have been very low.

The experiment can be repeated using the topdown solver on arkocal/hash_collision_demo branch, which includes a minimal module for counting how often each hash value appears.

In particular, variables that differ by a single string are mapped to the same hash.

The following changes already significantly reduce collisions (1.0008 keys per hash, no obvious similarities between colliding keys): In cdomains/stringDomain.ml (https://github.com/goblint/analyzer/blob/master/src/cdomain/value/cdomains/stringDomain.ml) :

let hash x =
  if get_string_domain () = Disjoint then
    hash x
  else
    (* 13859 <- old value*)
    hash x (* new value *)

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...

let hash = Hashtbl.hash

michael-schwarz commented 3 days ago

Nice catch!

@sim642: I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

For the hashing in cdomains/intDomain.ml, I guess this comes from https://github.com/sim642/ppx_deriving_hash? Maybe something can also be tweaked there? Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

sim642 commented 19 hours ago

I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

That seems possible. All uses of get_string_domain () there are in ifs, not matches, so it's error-prone indeed when something is added.

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...
let hash = Hashtbl.hash

This is wrong because some of the int domains (e.g. def_exc, enums) contain OCaml Sets where structural equality and hashing lead to problems because sets don't have a canonical representation (equivalent sets may have different Hashtbl.hash values). So it may be fast, but wrong things can be arbitrarily fast.

Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

I just tried (e.g. in base's init)

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

and this prints out different hash values [Error] -75963065849016984 39336082039394930 at 0deedb9ca2aeb11d9e6562161f57eee1908af498.

michael-schwarz commented 17 hours ago

Maybe I also misremembered some of the details. @arkocal do you still have the example at hand?

arkocal commented 15 hours ago

I am not sure if the suggested hashes above are correct or performant. They are not meant to be solution suggestions, just illustrative examples for the source of collisions, sorry for the confusion.

Testing with dd_comb.c from coreutils, without overwriting the hash for intDomain, I get the following:

%%% coll: Colliding variables:
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
and
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
%%% coll: Hash of a: 4529258937874703681,
 Hash of b: 4529258937874703681
%%% coll: Values are equal: false

Coming from:

      let interesting = 4529258937874703681 in
      let colliding = HM.keys rho |> List.of_enum |> List.filter (fun k -> S.Var.hash k = interesting) in
      match colliding with
      | a::b::[] -> 
          if tracing then trace "coll" "Colliding variables:\n %a \nand\n %a" S.Var.pretty_trace a S.Var.pretty_trace b;
          if tracing then trace "coll" "Hash of a: %d,\n Hash of b: %d" (S.Var.hash a) (S.Var.hash b);
          if tracing then trace "coll" "Values are equal: %b" (S.Var.equal a b);
      | _ -> ();
      ;

Since the string representations are the same, this might also be an issue with S.Var.equal.

sim642 commented 15 hours ago

Enabling dbg.trace.context would print out the entire unknown with the context. The difference hopefully is visible there.

arkocal commented 14 hours ago

This is very strange, trying:

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

Outputs

[Error] -75963065849016984 -75963065849016984

on my machine on 0deedb9ca2aeb11d9e6562161f57eee1908af498

arkocal commented 13 hours ago

With dbg.trace.context the only diffence is in base.size with T vs. (Not {0}((0,64)).

For completeness:

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   ⊤
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40
and

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   (Not {0}([0,64]))
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40

michael-schwarz commented 13 hours ago

Do you maybe have different versions of ppx_deriving_hash?

arkocal commented 12 hours ago

This would explain it.

I am getting the same values from a docker build as well. Looking at the Dockerfile it looks like my local state should not somehow creep into the container.

sim642 commented 12 hours ago

ppx_deriving_hash is not at fault because Goblint has a lower bound on the latest version. You couldn't even compile with a different version.

The problem is our fork of Zarith where we have version 1.12 with custom patches: https://github.com/goblint/Zarith/tree/goblint-1.12. Meanwhile, Zarith 1.14 improved the hash functions (https://github.com/ocaml/Zarith/pull/150) but our fork hasn't been adapted. Thus, depending on Gobview opam pins, the hash function used for the zero in the exclusion set could be different. The unlucky case happens when Z.hash Z.zero = 0.

That patch probably needs to be adapted if we want to benefit from Zarith improvements and keep Gobview.

arkocal commented 12 hours ago

With Zarith 1.14 I get the same values, and the few remaining conflicts have no obvious similarities.

michael-schwarz commented 12 hours ago

Yeah, potentially GobView can be retired at some point --- currently no one seems interested in it (correct me if I'm wrong). Though it feels like a waste to have invested so much effort to then not get a paper out of it.

goblint / analyzer

Variable hashes have too many collisions #1594