plasma-umass / cwhy

"See why!" Explains and suggests fixes for compile-time errors for C, C++, C#, Go, Java, LaTeX, PHP, Python, Ruby, Rust, and TypeScript
Apache License 2.0
273 stars 6 forks source link

Add diff command using OpenAI JSON function call #15

Closed nicovank closed 1 year ago

nicovank commented 1 year ago

This adds a diff function, which returns proposed changes in JSON format. In the example below, GPT suggested a single change, a second request is likely to push it to also give a pair_hash implementation.

% g++ -x c++ `./tests/anonymizer.py tests/c++/missing-hash.cpp` |& cwhy --llm gpt-4 --max-context=10 diff
{
  "diff": {
    "modifications": [
      {
        "filename": "/tmp/tmpxibfgyeh",
        "start-line-number": 13,
        "number-lines-remove": 1,
        "replacement": "std::unordered_set<std::pair<int, int>, pair_hash> visited;"
      }
    ]
  }
}

It also adds a test runner for this new command, that essentially tries to apply modifications recursively until the code compiles or some max tree depth is reached. Successive errors are not kept in context: each GPT request is sent as if it was the first one.

% python3 tests/runner-diff.py C++ missing-hash.cpp
============================ CWhy Test Runner ============================
LLM              : gpt-4
Timeout          : 180
Iterations       : 10
Max retries      : 5
Benchmark        : C++/missing-hash.cpp
==========================================================================
Trying to repair missing-hash.cpp (/tmp/tmpkbi0wfgg)...
Failed after 5 retries. Total tokens used per retry: [3493, 3492, 3223, 632, 864]
Trying to repair missing-hash.cpp (/tmp/tmpyg5mz2sh)...
Failed after 5 retries. Total tokens used per retry: [3483, 3042, 3578, 1001, 558]
Trying to repair missing-hash.cpp (/tmp/tmpdhmh9i1h)...
Failed after 5 retries. Total tokens used per retry: [3493, 689, 1060, 828, 1373]
Trying to repair missing-hash.cpp (/tmp/tmpn7x1prt2)...
Success in 1 retries! Total tokens used per retry: [3620]
Trying to repair missing-hash.cpp (/tmp/tmpfx17tava)...
Failed after 5 retries. Total tokens used per retry: [3467, 3622, 1181, 4928, 883]
Trying to repair missing-hash.cpp (/tmp/tmpgu5godhv)...
Failed after 5 retries. Total tokens used per retry: [3596, 1109, 1578, 1920, 553]
Trying to repair missing-hash.cpp (/tmp/tmplcl4a4f3)...
Success in 3 retries! Total tokens used per retry: [3645, 304, 259]
Trying to repair missing-hash.cpp (/tmp/tmp5q2kymdv)...
Success in 1 retries! Total tokens used per retry: [3592]
Trying to repair missing-hash.cpp (/tmp/tmproxu_zg7)...
Success in 1 retries! Total tokens used per retry: [3621]
Trying to repair missing-hash.cpp (/tmp/tmpea3cehav)...
Success in 1 retries! Total tokens used per retry: [3590]

C++/missing-hash.cpp success rate: 50.00%

The current max retries default is 5, I've seen successful runs in up to 4 retries (3 retries in sample output above). It is typically pretty clear for the human eye when GPT is lost for a given tree, not sure if there is any way to detect these cases and exit early.

For the missing-hash example that I've been experimenting with, failures typically happen because:

nicovank commented 1 year ago

It seems to do well with the Python example, failures coming from indentation (also the reason for a lot of 2 retries instead of 1). Maybe we can do something about that.

% python3 tests/runner-diff.py Python concatenate-str-to-int.py
============================ CWhy Test Runner ============================
LLM              : gpt-4
Timeout          : 180
Iterations       : 10
Max retries      : 5
Benchmark        : Python/concatenate-str-to-int.py
==========================================================================
Trying to repair concatenate-str-to-int.py (/tmp/tmpnvd_pwwh)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1g7zjm1w)...
Success in 2 retries! Total tokens used per retry: [329, 226]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1zmrclxw)...
Failed after 5 retries. Total tokens used per retry: [325, 224, 211, 222, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpjwsobezo)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmp1vykp2tb)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpnsfmy39g)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpbtprjz36)...
Success in 2 retries! Total tokens used per retry: [323, 226]
Trying to repair concatenate-str-to-int.py (/tmp/tmpv95awox3)...
Success in 2 retries! Total tokens used per retry: [321, 222]
Trying to repair concatenate-str-to-int.py (/tmp/tmpytv0fs6r)...
Success in 2 retries! Total tokens used per retry: [327, 228]
Trying to repair concatenate-str-to-int.py (/tmp/tmpz8ze9xf3)...
Success in 2 retries! Total tokens used per retry: [325, 224]

Python/concatenate-str-to-int.py success rate: 90.00%