[Bug]: CompressibleAgents read llm_config["model"] directly rather than from the config_list -- this is almost always the wrong thing to do

afourney commented 8 months ago

Describe the bug

Throughout the code, CompressibleAgent assumes the model in use is llm_config["model"]. However, this is almost always wrong. Typically, the model copied from the config_list before the OAI call, replacing llm_config["model"] (which is hardcoded to default to GPT-4): https://github.com/microsoft/autogen/blob/14a96720323b0452dbc5e146158d6e04b1186c6e/autogen/agentchat/conversable_agent.py#L46C1-L48C6

This bug was uncovered when trying to remove the (extremely dangerous) GPT-4 default in #1072

This means that compression basically going to kick in after 8k tokens regardless of what model is actually being used.

Fixing this bug will be exceptionally hard, since there's no way to know which model is being called in the config_list, and the choice is made call-to-call dynamically.

Steps to reproduce

No response

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

No response

afourney commented 8 months ago

@kevin666aa @sonichi @qingyun-wu @rickyloynd-microsoft I think this issue is exceptionally serious. I am not sure how to fix it, short of:

Reading the config_list and picking the model with the fewest tokens in the context window to be safe, or,
Adding a max_context_length parameter to set separately

We cannot just read the model directly from the llm_config. It will almost always be wrong.

yiranwu0 commented 8 months ago

About compressible agent, maybe we should allow passing an extrallm_config to compress_config ? This is done in #1004 and I can make a quick PR to get it fixed.

Fixing this bug will be exceptionally hard, since there's no way to know which model is being called in the config_list, and the choice is made call-to-call dynamically.

Regarding calling the models, I was trying to ensure the calling of models matching the specified model in #847. But this is based on a strong assumption that when a user passes in config_list and model, he only wants to use the specified model in the config_list, otherwise it will result in an error. This assumption might be unjustified as pointed out by @sonichi.

I am thinking about the necessity to have several models in a config_list for users to call from. For example, using gpt-4-0613 and gpt-4-1106. These models could have quite different behaviors, and normally I wouldn't want gpt-4-1106 as a backup for gpt-4-0613. Thus I would only want to pass in one model in the config_list. Any other scenarios that we need the automatic switching of models?

afourney commented 8 months ago

I agree that we likely want to pass a number of different models for different purposes. I do this in the WebSurferAgent:

https://github.com/microsoft/autogen/blob/97555f16992240388e634b878138b550b76d72ae/autogen/agentchat/contrib/web_surfer.py#L37C1-L38C1

We probably need a standard way of doing this.

But the problem remains that the CompressibleAgent is making an extremely strong unjustified assumption that user's either aren't setting a config_list, or are setting both the config_list AND llm_config["model"] -- we don't do this anywhere else in the codebase, it's undocumented, and the even some test cases rely on little-known dangerous defaults.

A stopgap measure right now is to collect all the models in the config_list, and take their min context window size, and use that.

afourney commented 8 months ago

I am thinking about the necessity to have several models in a config_list for users to call from. For example, using gpt-4-0613 and gpt-4-1106. These models could have quite different behaviors, and normally I wouldn't want gpt-4-1106 as a backup for gpt-4-0613. Thus I would only want to pass in one model in the config_list. Any other scenarios that we need the automatic switching of models?

Maybe we need to work on this #970 ?

yiranwu0 commented 8 months ago

A stopgap measure right now is to collect all the models in the config_list, and take their min context window size, and use that.

I've trying to resolve this but there is a problem. Since compressible_agent checks token_count before the completion, we need to know what model will be used to select corresponding tokenizers to do the token count. But we don't know that yet due to the dynamic model selection in client. I guess a way to this is to tell user we are assuming the completion model is the one with the min context window size.

afourney commented 8 months ago

Yes, exactly. For now we have to use the smallest window in the list or else risk overflowing. I think that's fine for now.

sonichi commented 8 months ago

Can we trigger the compression only when the context overflow happens for all the configs in the list? And the compression function needs to have a correct abstraction to handle the config_list. @kevin666aa

yiranwu0 commented 8 months ago

Can we trigger the compression only when the context overflow happens for all the configs in the list? And the compression function needs to have a correct abstraction to handle the config_list. @kevin666aa

Yes, so we can just take the max context window of the models in the config_list instead of the min? As long as one model from the config_list works, there will be no context overflow error.

sonichi commented 8 months ago

That's one possible implementation. It's important to make the abstraction right and provide flexibility to easily change the implementation without changing the abstraction.

afourney commented 8 months ago

Can we trigger the compression only when the context overflow happens for all the configs in the list? And the compression function needs to have a correct abstraction to handle the config_list. @kevin666aa

Yes, so we can just take the max context window of the models in the config_list instead of the min? As long as one model from the config_list works, there will be no context overflow error.

There are many reasons a model may fail (e.g., a timeout error). The min will work with every model, and will uphold one of the intentions of the config_list (redundancy). Whereas the max will only work with maybe just one model, removing redundancy. It's also possibly very inefficient, as it will basically make a call to OpenAI for every model as it works its way down the list, knowing full-well that those calls will be rejected. Finally, it's also probably a recipe towards incurring unexpected costs since it will prefer keeping more in the context window, even as smaller models were specified (since more tokens generally means more $)

yiranwu0 commented 8 months ago

Got it. Maybe we can do the temporary fix of using the model with the minimum token size. Then we switch to using the max window size, but we will iterate over all models and check for token_size and fine those with enough context limit (without calling them and raise context error). What do you think?

sonichi commented 8 months ago

I believe that we should rethink how we handle the compression in a more principled way. It appears to me that this experiment is not very successful so far and the cost/benefit tradeoff of continuing without a convincing design looks unpromising. The work under contrib/ are supposed to quickly demonstrate some PoC and over time contribute to the core. We should fail fast if they don't work. We can take the lessons learned and apply them to new work.

@rickyloynd-microsoft has some good idea in #1091. Maybe a similar and simple approach can be used to handle the compression. In either case, let's discuss the overall design first before execution this time.

afourney commented 8 months ago

@ekzhu and I discussed this earlier this week, and also at the Friday SCRUM meeting. I think we want some abstract data structure for storing message histories, similar to the current chat_messages dictionary, with the expectation that we can always recover losslessly every message. But, we can also imposes different "views" on this message store, and through those views retrieve versions of that history subject to different constraints or compression strategies like a sliding window, summarization, compression, etc. If we properly design this interface it could be not just for agent conversations but anywhere where we might be calling the LLMs (orchestration, etc.), and the same agent could use different views for different purposes.

ekzhu commented 8 months ago

@ekzhu and I discussed this earlier this week, and also at the Friday SCRUM meeting. I think we want some abstract data structure for storing message histories, similar to the current chat_messages dictionary, with the expectation that we can always recover losslessly every message. But, we can also imposes different "views" on this message store, and through those views retrieve versions of that history subject to different constraints or compression strategies like a sliding window, summarization, compression, etc. If we properly design this interface it could be not just for agent conversations but anywhere where we might be calling the LLMs (orchestration, etc.), and the same agent could use different views for different purposes.

I believe this can be done as an incremental patch without breaking existing functionality, yet enable more functionality as described.

microsoft / autogen