Add diff command using OpenAI JSON function call

This adds a diff function, which returns proposed changes in JSON format. In the example below, GPT suggested a single change, a second request is likely to push it to also give a pair_hash implementation.

% g++ -x c++ `./tests/anonymizer.py tests/c++/missing-hash.cpp` |& cwhy --llm gpt-4 --max-context=10 diff
{
  "diff": {
    "modifications": [
      {
        "filename": "/tmp/tmpxibfgyeh",
        "start-line-number": 13,
        "number-lines-remove": 1,
        "replacement": "std::unordered_set<std::pair<int, int>, pair_hash> visited;"
      }
    ]
  }
}

It also adds a test runner for this new command, that essentially tries to apply modifications recursively until the code compiles or some max tree depth is reached. Successive errors are not kept in context: each GPT request is sent as if it was the first one.

% python3 tests/runner-diff.py C++ missing-hash.cpp
============================ CWhy Test Runner ============================
LLM              : gpt-4
Timeout          : 180
Iterations       : 10
Max retries      : 5
Benchmark        : C++/missing-hash.cpp
==========================================================================
Trying to repair missing-hash.cpp (/tmp/tmpkbi0wfgg)...
Failed after 5 retries. Total tokens used per retry: [3493, 3492, 3223, 632, 864]
Trying to repair missing-hash.cpp (/tmp/tmpyg5mz2sh)...
Failed after 5 retries. Total tokens used per retry: [3483, 3042, 3578, 1001, 558]
Trying to repair missing-hash.cpp (/tmp/tmpdhmh9i1h)...
Failed after 5 retries. Total tokens used per retry: [3493, 689, 1060, 828, 1373]
Trying to repair missing-hash.cpp (/tmp/tmpn7x1prt2)...
Success in 1 retries! Total tokens used per retry: [3620]
Trying to repair missing-hash.cpp (/tmp/tmpfx17tava)...
Failed after 5 retries. Total tokens used per retry: [3467, 3622, 1181, 4928, 883]
Trying to repair missing-hash.cpp (/tmp/tmpgu5godhv)...
Failed after 5 retries. Total tokens used per retry: [3596, 1109, 1578, 1920, 553]
Trying to repair missing-hash.cpp (/tmp/tmplcl4a4f3)...
Success in 3 retries! Total tokens used per retry: [3645, 304, 259]
Trying to repair missing-hash.cpp (/tmp/tmp5q2kymdv)...
Success in 1 retries! Total tokens used per retry: [3592]
Trying to repair missing-hash.cpp (/tmp/tmproxu_zg7)...
Success in 1 retries! Total tokens used per retry: [3621]
Trying to repair missing-hash.cpp (/tmp/tmpea3cehav)...
Success in 1 retries! Total tokens used per retry: [3590]

C++/missing-hash.cpp success rate: 50.00%

The current max retries default is 5, I've seen successful runs in up to 4 retries (3 retries in sample output above). It is typically pretty clear for the human eye when GPT is lost for a given tree, not sure if there is any way to detect these cases and exit early.

For the missing-hash example that I've been experimenting with, failures typically happen because:

GPT tries to convert the set to a map.
GPT tries to use Boost.
GPT alters code too much and is not capable to recover (braces issues, off-by-one line numbers)
line numbers are wrong.

It seems to do well with the Python example, failures coming from indentation (also the reason for a lot of 2 retries instead of 1). Maybe we can do something about that.

% python3 tests/runner-diff.py Python concatenate-str-to-int.py
============================ CWhy Test Runner ============================
LLM              : gpt-4
Timeout          : 180
Iterations       : 10
Max retries      : 5
Benchmark        : Python/concatenate-str-to-int.py
==========================================================================
Trying to repair concatenate-str-to-int.py (/tmp/tmpnvd_pwwh)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1g7zjm1w)...
Success in 2 retries! Total tokens used per retry: [329, 226]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1zmrclxw)...
Failed after 5 retries. Total tokens used per retry: [325, 224, 211, 222, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpjwsobezo)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1vykp2tb)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpnsfmy39g)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpbtprjz36)...
Success in 2 retries! Total tokens used per retry: [323, 226]
Trying to repair concatenate-str-to-int.py (/tmp/tmpv95awox3)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpytv0fs6r)...
Success in 2 retries! Total tokens used per retry: [327, 228]
Trying to repair concatenate-str-to-int.py (/tmp/tmpz8ze9xf3)...
Success in 2 retries! Total tokens used per retry: [325, 224]

Python/concatenate-str-to-int.py success rate: 90.00%

plasma-umass / cwhy

Add diff command using OpenAI JSON function call #15