Open mw19930312 opened 9 months ago
🤖
Hey @mw19930312, great to see you back diving into the depths of LlamaIndex! 🦙
Based on the information you've provided, it seems like you're using the TokenCountingHandler
correctly. However, there are a few things that could be causing the total_llm_token_count
to remain zero.
Firstly, the on_event_end
method in the TokenCountingHandler
is responsible for updating the llm_token_counts
list, which is then used to calculate the total_llm_token_count
. This method checks if the event type is CBEventType.LLM
and if the event type is not in the event_ends_to_ignore
list. If these conditions are met and the payload is not None
, it calls the get_llm_token_counts
function with the token_counter
and payload
as arguments. The result is then appended to the llm_token_counts
list. If these conditions are not met, the llm_token_counts
list will not be updated, and the total_llm_token_count
will remain zero. You can find more details about this in the source code.
Secondly, there have been changes in the way TokenCountingHandler
works between the versions 0.9.37.post1 and 0.9.45.post1 of LlamaIndex. In the newer version, token counting has moved into a callback, and you now have more options for how tokens are counted, the lifetime of the token counts, and even creating separate token counters for different indexes. You can find more details about this in the migration guide.
Lastly, there have been similar issues reported in the past where the TokenCountingHandler
was not working as expected. In one case, the issue was due to the overwriting of the TokenCountingHandler
instance, which led to incorrect results. The user resolved the issue by saving the instance of TokenCountingHandler
in a session state object. You can find more details about this in the issue #10482 and issue #9857.
I hope this helps! If you're still having issues, could you provide more details about your setup and how you're using the TokenCountingHandler
? This will help me better understand the issue and provide a more accurate solution.
for MM model, we need to add support for CallbackManager
.
https://github.com/run-llama/llama_index/blob/main/llama_index/multi_modal_llms/base.py#L154-L166
@mw19930312 feel free to add a pr for adding this new feature.
I think you can follow https://github.com/run-llama/llama_index/blob/main/llama_index/llms/base.py#L31-L280 to add it.
for MM model, we need to add support for
CallbackManager
. https://github.com/run-llama/llama_index/blob/main/llama_index/multi_modal_llms/base.py#L154-L166 @mw19930312 feel free to add a pr for adding this new feature. I think you can follow https://github.com/run-llama/llama_index/blob/main/llama_index/llms/base.py#L31-L280 to add it.
Thanks for the reply @hatianzhang! Are you suggesting that we need to maintain the counter by ourselves? Is there any timeline for llama_index to officially support it?
I mean we should add this feature from base class from llama_index abstraction.
I mean we should add this feature from base class from llama_index abstraction.
I see. Thanks for the explanation! Would you mind elaborating a bit on how I should change my code? I'm not following the link you sent to me because I'm new to llama_index.
it would be easier that I could probably make a pr from my branch for this. could you help verify it once the pr is there and you could use your example by checkout my branch/pr.
thanks
Sure! Let me know when your pr is ready and I can test it on my side. Is there an expected timeline for you change?
@hatianzhang ^^
@mw19930312 sorry I am pretty busy with other tasks atm. will update asap
I am not sure if I am doing something wrong, but it seems it is broken in some other cases: Sync calls with CondenseQuestionChatEngine Async calls with CondenseQuestionChatEngine , CondensePlusContextChatEngine and ContextChatEngine All those calls are used with a steaming response, when possible.
I have the same code setting up sync and async versions, so the fact there are differences in results seem to suggest there are some issues linked to LlamaIndex.
For CondenseQuestionChatEngine it seems it could be caused by conflicted use of service_context and Settings.callback_manager during initialization. But the underlying llm
attribute has the callback_context variable set, so I don't understand why this is not used during generation.
EDIT: passing again service_context in any of the non-working calls above resolves the issue
bump
Bug Description
I'm trying to calculate the number of tokens consumed by my GPT4 vision call using the following code. however, the total_llm_token_count is always zero. Can anyone tell me how to configure the counter for GPT4 vision?
Version
0.9.37.post1
Steps to Reproduce
document = ImageDocument(text="", image=encoded_image) token_counter = TokenCountingHandler(tokenizer=tiktoken.encoding_for_model("gpt-4-vision-preview").encode) llm = OpenAIMultiModal( model="gpt-4-vision-preview", api_key=settings.OPENAI_API_KEY, max_new_tokens=300, additional_kwargs={"seed": 42}, callback_manager=CallbackManager([token_counter]), ) response = llm.complete(prompt, image_documents=[document]) print(token_counter.total_llm_token_count)
Relevant Logs/Tracbacks
No response