noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning
MIT License
2.34k stars 225 forks source link

Potential unexpected behavior in executor (that uses the`global`) #40

Open allanj opened 6 months ago

allanj commented 6 months ago

I was running the script https://github.com/noahshinn/reflexion/blob/main/programming_runs/run_reflexion_codellama_multi.sh with CodeLLaMA model, simply change the codellama to codellama-7b

CUDA_VISIBLE_DEVICES=$1 python main.py \
  --run_name "reflexion_codellama_$1" \
  --root_dir "root" \
  --dataset_path ./benchmarks/humaneval-py.jsonl \
  --strategy "reflexion" \
  --language "py" \
  --model "codellama-7b" \
  --pass_at_k "1" \
  --max_iters "2" \
  --verbose | tee ./logs/reflexion_codellama_$1

The performance is around 30~40% though it will get killed in the middle of the experiment.

Observation

Some predictions present unexpected behavior while it is correct prediction is_solved = True. In the following prediction (which is also attached here)

  1. The first program generated actually failed the generated test cases (NOTE: these test cases actually generated by CodeLLaMA-7B), although this program passes the ground-truth test cases (eventually).
  2. After reflection, the model actually generate nothing (i.e., None).
  3. So, suppose we should take "" as our final output, right?
  4. But the "is_solved" is actually assigned to True.

image

Reason

Because we always use the function globals(), even though the program after reflexion has nothing, globals() already contain the program in the first attempt.

Thus, as we do not clear the information in globals() every time, potentially this could lead to some unexpected behavior.

I have been a bit confused here, hope the authors can clarify more on this.

allanj commented 6 months ago

Especially, globals() will definitely memorize previous implementations, which will be used if the future generation is None.

Or in future generation, it generate something (e.g., some functions) not exists but those functions are memorized in globals(). Potentially, that has some contamination in the data.

That means, the order to run the evaluation samples matters

reyrey-app commented 5 months ago

So if I understand correctly, it could potentially cause incorrect answers to be correct?

allanj commented 5 months ago

Yes, right!.

But I assume for more capable models like GPT-3.5/4 or other large models, such an issue might rarely happen.

I just got in touch with the authors, and currently making a fix.

allanj commented 5 months ago

I think the current issue is that, we need to deal with the situation if cur_func_impl is None, what we are going to do,

We always put assertion for the current implementation

assert isinstance(cur_func_impl, str)

because it is possible that "cur_func_impl" is pure natural language response without any code extracted, which make the variable cur_func_impl to be None.

My current attempt:

  1. For the first attempt, we just set to "" empty string, as it is also initial implementation

    if cur_func_impl is None:
    # code cannot be extracted from the response.
    cur_func_impl = ""
  2. For the reflection attempt, we can roll back and reflect again, rather than put an empty string.

    if reflected_func_impl is None:
    # code cannot be extracted from the response.
    # we just skip this iteration and reflection again.
    cur_iter += 1
    continue

I'm making a PR on this, and will update more results here

Some changes: https://github.com/noahshinn/reflexion/compare/main...allanj:reflexion:fix_global I'm running some experiments to make sure the results

allanj commented 5 months ago

If I clear the globals() variable information, the performance seems to drop a lot (if using CodeLLaMA-7B), I'm checking the reason