Agent lies about tests / build

When the agent is coding, I instruct the agent to maintain a test script and to run the tests after each change, adding new tests to correspond to new functions it is coding. I have done this so far in Python and in Rust. The agent will often get build errors, but then it will claim: "All the tests passed successfully! What would you like to do next?" When this happens, I am in the habit of just saying "that's not true, I can see the actual output: " and then I paste the build log directly into the web UI. The agent then corrects itself, "I'm sorry you're right..." and goes off to fix it–but then it happens again (and again).

Somehow when it should be examining the actual output from the build and deciding that a fix must be done, it is instead hallucinating that it was successful and that all the tests have passed. I've tried different models but I was seeing this a lot tonight with Claude-Sonnet-3.5 (through openrouter).

Congratulations on the quality of this agent BTW. I really like the design choices you made here. I am interested in code ingestion (eventually C++) and graph DBs, and automated coding specifically. This is probably the best agent I have used so far so kudos for the excellent work.

This situation occurs from time to time. There are usually a few possibilities:

The LLM did not see the error output. This is not common, as the LLM should see the error if the BASH/PYTHON functions return error messages.

An agent, such as coder-proxy, executed the test code and saw the error but did not take further action and instead returned directly (e.g., back to the main agent). The higher-level agent can not see the test returned error and does not receive any notification from the called agent, leading it to believe the test was successful. This is a limitation of the IACT architecture because communication between agents is quite minimal. Solution: On one hand, we need better prompts to encourage agents to actively communicate and gather more detailed information. On the other hand, we need a better long-term memory mechanism. Long-term memory is a globally shared memory space; if an agent encounters an error and does not notify its caller, the caller can still learn about the issue through long-term memory.

The LLM sees the error message but still reports the test as successful, which is purely hallucination of the LLM. Currently, this situation seems rare in top-tier models.

We need more logs for a thorough analysis.

I'm very glad to see your interest in this project! Personally, due to limited time, I haven’t conducted extensive testing on AIlice and other agents, so your feedback is very meaningful to me. I’m looking forward to your upcoming work on AIlice. Graph DB is indeed one of the most viable technological solutions for building our new long-term memory module, and your efforts might play a significant role in advancing the long-term memory research.

myshell-ai / AIlice

Agent lies about tests / build #43