git - Githubissues

git analysis was previously discussed in #14, but since we now tried harder, I'm extracting that to this separate issue for better organization.

Below are comments from there with relevant analysis attempts.

@karoliineh: https://github.com/goblint/bench/issues/14#issuecomment-1747276667

I tried to analyze the git repository, once again. I tried it with generating a compilation database with bear. As there are a lot of different main functions in the project, @sim642 suggested that one might try to compile the commands separately. Therefore I tried analyzing the following 3 commands:

Just for fun, the first thing that came to mind:

NO_OPENSSL=1 NO_CURL=1 bear -- make git-cherry-pick
goblint compile_commands.json

Actually has pthread_create:

NO_OPENSSL=1 NO_CURL=1 bear -- make -j 11 git-grep
goblint compile_commands.json

Maybe analyzing some of the "simplest" commands with the simplest conf won't run out of memory?:

NO_OPENSSL=1 NO_CURL=1 bear -- make -j 11 git-add
goblint compile_commands.json --conf /analyzer/conf/examples/large-program.json

However, in all of my attempted runs, Goblint got killed (most possibly due to out-of-memory). @sim642 suggested that maybe one could try and run it on a server with more resources.

@sim642: https://github.com/goblint/bench/issues/14#issuecomment-1748944995

I tried this overnight on our server with 32GB RAM:

NO_OPENSSL=1 NO_CURL=1 bear -- make -j 15 git-grep
goblint --conf examples/large-program.json compile_commands.json

It mysteriously died after 3h20min: no "Killed", exit code 0, just stopped with no explanation.

Today I tried it again with some profiling and it crashed after the same time, this time explicitly "Killed". So feel free to try it on your own but I don't have too high expectations. At the moment of death Goblint was at this point:

runtime: 03:20:37.661
vars: 7080440, evals: 29235491

|rho|=7080440
|stable|=6572163
|infl|=7078192
|wpoint|=74592
|sides|=424929
|side_dep|=0
|side_infl|=0
|var_messages|=0
|rho_write|=0
|dep|=6226162
|called|=228
Found 415057 contexts for 1822 functions. Top 5 functions:
29556   contexts for L:entry state of vsnprintf (5042244) on /usr/include/x86_64-linux-gnu/bits/stdio2.h:81:1-87:1
24175   contexts for L:entry state of strbuf_vaddf (11685) on git/strbuf.c:393:1-412:1
18211   contexts for L:entry state of memcpy (5042287) on /usr/include/x86_64-linux-gnu/bits/string_fortified.h:25:1-31:1
16294   contexts for L:entry state of append_quoted_string (5089741) on git/json-writer.c:19:1-45:1
12448   contexts for L:entry state of jw_object_string (2135970) on git/json-writer.c:158:1-162:1

Memory statistics: total=21971776.09MB, max=35359.31MB, minor=21933689.68MB, major=401256.95MB, promoted=363170.54MB
    minor collections=10459247  major collections=121 compactions=0

Even with the large-program conf which has almost no analyses and nothing but pointers in contexts, there are still tens of thousands of contexts for some functions. This is with earlyglobs where the globals aren't in the local states/contexts either. According to the profiling, over 1h of the runtime was just spent on some silly die function (which can recurse!). In fact, there seems to be a lot of recursion in git. I also tried with --enable ana.context.widen and it seemed to make things even worse by never even getting out of analyzing some very first initialization function. There must be some stupidity happening in Goblint if the majority of time is spent analyzing error cases where git dies instead of the actual meat of git.

Notes on a few more attempts I did and profiled:

One possible reason for so many contexts with ana.base.context.non-ptr disabled could be string literals (StrPtr) in address sets, which would be preserved in contexts. ana.base.limit-string-addresses (true by default) doesn't do enough limitation. When true, string literals are handled like a flat lattice (a la constant propagation), which still yields separate contexts for all different calls with string literal arguments. When false, string literals are handled as a set.

It would be possible to do even more limiting: not track any string literals, i.e. use a unit domain for them. Currently string domains aren't properly modularized but I hacked it by making of_string always return StrPtr None. This at least got rid of many contexts for some functions, but not all.
I also tried something even more radical: making Goblint fully context- and path-insensitive:
```
--conf examples/large-program.json --set ana.path_sens '[]' --set ana.ctx_insens '["base","mallocWrapper","escape","mutex","mutexEvents","access","race","assert"]'
```
But that also seemed worse and got stuck analyzing some trace2 init function. The curse of no contexts at all is that any change due to one call destabilizes all call sites and thus can actually lead to worse performance.
For fun I also tried a degenerate analysis with --set ana.activated '[]', which is essentially a unit analysis. After 80s of parsing, etc, this finished solving in 2s. But of course this is wildly unsound and most code was dead because it cannot handle function pointer calls, which git seems to use. @vesalvojdani suggested trying out https://github.com/goblint/analyzer/pull/1063.

goblint / bench

git #62

@karoliineh: https://github.com/goblint/bench/issues/14#issuecomment-1747276667

@sim642: https://github.com/goblint/bench/issues/14#issuecomment-1748944995