noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning
MIT License
2.46k stars 240 forks source link

Interpreting results files #3

Closed sachit-menon closed 1 year ago

sachit-menon commented 1 year ago

Hi, super interesting work here! I was wondering how to interpret the results files in root -- initially I thought is_solved meant correct or not, and that would indeed give 87.8% for Reflexion with GPT-4; but then the equivalent file without Reflexion gets 81.7 when my impression is that should be 67. How should I be interpreting the columns, and if is_solved != correct, how do I check correctness? + I see the reflections for the ones that have is_solved False, but not the predicted (incorrect) solutions, how can I see those?

drammen94 commented 1 year ago

@sachit-menon i know this is unrelated to your question, but did you change any specific parameters when running the solution? i get is_solved= False on all the tasks

sachit-menon commented 1 year ago

I haven't even tried running it yet, just looking at the already-included output files 😅

drammen94 commented 1 year ago

Looks like the code doesn't log the correct solutions

From reflexion.py:

    if is_solved:
        item["is_solved"] = True
    else:
        item["is_solved"] = False
        item["solution"] = ""
    item["reflections"] = reflections
    write_jsonl(log_path, [item], append=True)
noahshinn commented 1 year ago

Hi, super interesting work here! I was wondering how to interpret the results files in root -- initially I thought is_solved meant correct or not, and that would indeed give 87.8% for Reflexion with GPT-4; but then the equivalent file without Reflexion gets 81.7 when my impression is that should be 67. How should I be interpreting the columns, and if is_solved != correct, how do I check correctness? + I see the reflections for the ones that have is_solved False, but not the predicted (incorrect) solutions, how can I see those?

Hi, thanks for the note! I've pushed some changes with cleanup on some of the logic and with another rerun of the GPT-4 results for those that do not want to rerun the entire benchmark. Use ./validate_py_results.py as a util script for new results if you choose to rerun.

noahshinn commented 1 year ago

Looks like the code doesn't log the correct solutions

From reflexion.py:

    if is_solved:
        item["is_solved"] = True
    else:
        item["is_solved"] = False
        item["solution"] = ""
    item["reflections"] = reflections
    write_jsonl(log_path, [item], append=True)

Thanks for the note. I've cleaned up the code a bit to remove redundancy. Previously, I was only logging successful solutions.

sachit-menon commented 1 year ago

Thanks Noah! In line 30 of validate_results.py I think there's a typo -- green_text_out = green_text(f"passes {num_tests}/{num_tests} test cases") is what displays how many pass, but that's just the same var over itself. Where is the actual number of test cases that pass computed instead of just the number present in item["test"] total?

Edit: actually, I guess it's not a typo, my bad -- if there are no exceptions, it means the test cases all passed, since they're using asserts. So I guess it doesn't measure how many passed or not, just all/nothing. (Still interested in the below question!)

Also, if is_solved is False <=> no test cases pass? I interpreted your previous comment as saying you didn't save the incorrect solutions before but they're in the updated files, but it looks like the field is still blank for any is_solved False cases.