Intelligent switching to larger context window version of GPT-4

waleedkadous commented 10 months ago

Difficulty: Easy Est time: 4 hours.

GPT-4 has two context windows available: 8K tokens and 32K tokens. The vast majority of conversations on Ansari are < 8K, so it doesn't make sense to pay double the price for unused features.

Currently Ansari crashes if the context exxceeds 8K tokens.

This modification would switch to GPT-4 with 32K context window when the conversation had gottend long enough to need it, but use 8K at the beginning of conversations. When the content exceeded 32K, it would then delete or summarize earlier conversation history.

younos-anaga commented 9 months ago

I took a look. This will require a few changes:

To know how many tokens the conversation is using up so far, we need to read the token_usage in the response from GPT-4. We can use the LangChain's OpenAICallbackHandler . To do this, the LangchainChatAgent needs to be updated to allow adding more callbacks. For example, always have the StreamingCBH and have another method that allows subclasses to override it and create more callbacks:
```
'callbacks': [self.StreamingCBH(myq)].extend(self.create_callback_handlers())
```
While we are at it, also update the LangchainChatAgent so that the llm is a @property, not a directly accessed member. I also think it should be abstract since there is not reasonable default for it (possibly making the whole class an ABC).
In AnsariLangchain, do two things: 3.a. Override the new create_callback_handlers to add the OpenAICallbackHandler and also store it in a private member _last_prediction_openai_callback_handler. 3.b. Implement the new llm property to check the _last_prediction_openai_callback_handler.total_tokens. If they are more than a certain threshold (7000?), return an instance of ChatOpenAI(temperature=0, model_name="gpt-4-32k", streaming=True) (two instances of ChatOpenAI will also be created in the init).

What do you think?

waleedkadous commented 9 months ago

This seems like a very reasonable way to do it (I've been working on Hermetic in a separate repo but I can definitely modify -- feel free to send me PRs to either repo).

Only design change I would consider is whether instead of relying on the callbacks from OpenAI we could use the tiktoken library (https://github.com/openai/tiktoken -- also from OpenAI). This way we could avoid all the complex piping of callbacks and just compute the tokenization count in the extension loop in a few calls. I've found it's only approximately the same length, but it's good enough for this type of work.

Either way, sounds good! When can you start? :)

younos-anaga commented 9 months ago

Thanks for pointing out Tiktoken. So to make sure I get what you mean, we will tokenize locally to estimate which API to call, then send the untokenized input to the API, right?

I already started :) I just do bits of contributing when there are gaps in the day to day work

On Sun, Sep 24, 2023 at 4:27 PM M Waleed Kadous @.***> wrote:

This seems like a very reasonable way to do it (I've been working on Hermetic in a separate repo but I can definitely modify -- feel free to send me PRs to either repo).

Only design change I would consider is whether instead of relying on the callbacks from OpenAI we could use the tiktoken library ( https://github.com/openai/tiktoken -- also from OpenAI). This way we could avoid all the complex piping of callbacks and just compute the tokenization count in the extension loop in a few calls. I've found it's only approximately the same length, but it's good enough for this type of work.

Either way, sounds good! When can you start? :)

— Reply to this email directly, view it on GitHub https://github.com/waleedkadous/ansari/issues/5#issuecomment-1732696422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUQXIURU5H2H6HDQNYRJVCTX4C6WJANCNFSM6AAAAAA4KORNJY . You are receiving this because you commented.Message ID: @.***>

younos-anaga commented 9 months ago

I made a couple of PRs, which are still not tested. Please let me know how to run to test, and let me know if you have any comments:

https://github.com/anyscale/hermetic/pull/1 https://github.com/waleedkadous/ansari/pull/10

PTAL. Thanks!

younos-anaga commented 9 months ago

I just found out about https://python.langchain.com/docs/modules/memory/types/token_buffer Do you prefer using more commonly used OSS components?

waleedkadous commented 7 months ago

OpenAI just released GPT-4-preview which has a 128,000 token window. Since this makes this bug much less of an issue, I am closing this for now.

waleedkadous / ansari-backend

Intelligent switching to larger context window version of GPT-4 #5