Closed waleedkadous closed 7 months ago
I took a look. This will require a few changes:
To know how many tokens the conversation is using up so far, we need to read the token_usage
in the response from GPT-4. We can use the LangChain's OpenAICallbackHandler . To do this, the LangchainChatAgent
needs to be updated to allow adding more callbacks. For example, always have the StreamingCBH
and have another method that allows subclasses to override it and create more callbacks:
'callbacks': [self.StreamingCBH(myq)].extend(self.create_callback_handlers())
While we are at it, also update the LangchainChatAgent
so that the llm
is a @property
, not a directly accessed member. I also think it should be abstract since there is not reasonable default for it (possibly making the whole class an ABC).
In AnsariLangchain
, do two things:
3.a. Override the new create_callback_handlers
to add the OpenAICallbackHandler
and also store it in a private member _last_prediction_openai_callback_handler
.
3.b. Implement the new llm
property to check the _last_prediction_openai_callback_handler.total_tokens
. If they are more than a certain threshold (7000?), return an instance of ChatOpenAI(temperature=0, model_name="gpt-4-32k", streaming=True)
(two instances of ChatOpenAI will also be created in the init).
What do you think?
This seems like a very reasonable way to do it (I've been working on Hermetic in a separate repo but I can definitely modify -- feel free to send me PRs to either repo).
Only design change I would consider is whether instead of relying on the callbacks from OpenAI we could use the tiktoken library (https://github.com/openai/tiktoken -- also from OpenAI). This way we could avoid all the complex piping of callbacks and just compute the tokenization count in the extension loop in a few calls. I've found it's only approximately the same length, but it's good enough for this type of work.
Either way, sounds good! When can you start? :)
Thanks for pointing out Tiktoken. So to make sure I get what you mean, we will tokenize locally to estimate which API to call, then send the untokenized input to the API, right?
I already started :) I just do bits of contributing when there are gaps in the day to day work
On Sun, Sep 24, 2023 at 4:27 PM M Waleed Kadous @.***> wrote:
This seems like a very reasonable way to do it (I've been working on Hermetic in a separate repo but I can definitely modify -- feel free to send me PRs to either repo).
Only design change I would consider is whether instead of relying on the callbacks from OpenAI we could use the tiktoken library ( https://github.com/openai/tiktoken -- also from OpenAI). This way we could avoid all the complex piping of callbacks and just compute the tokenization count in the extension loop in a few calls. I've found it's only approximately the same length, but it's good enough for this type of work.
Either way, sounds good! When can you start? :)
— Reply to this email directly, view it on GitHub https://github.com/waleedkadous/ansari/issues/5#issuecomment-1732696422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUQXIURU5H2H6HDQNYRJVCTX4C6WJANCNFSM6AAAAAA4KORNJY . You are receiving this because you commented.Message ID: @.***>
I made a couple of PRs, which are still not tested. Please let me know how to run to test, and let me know if you have any comments:
https://github.com/anyscale/hermetic/pull/1 https://github.com/waleedkadous/ansari/pull/10
PTAL. Thanks!
I just found out about https://python.langchain.com/docs/modules/memory/types/token_buffer Do you prefer using more commonly used OSS components?
OpenAI just released GPT-4-preview which has a 128,000 token window. Since this makes this bug much less of an issue, I am closing this for now.
Difficulty: Easy Est time: 4 hours.
GPT-4 has two context windows available: 8K tokens and 32K tokens. The vast majority of conversations on Ansari are < 8K, so it doesn't make sense to pay double the price for unused features.
Currently Ansari crashes if the context exxceeds 8K tokens.
This modification would switch to GPT-4 with 32K context window when the conversation had gottend long enough to need it, but use 8K at the beginning of conversations. When the content exceeded 32K, it would then delete or summarize earlier conversation history.