run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.75k stars 5.27k forks source link

[Bug]: Token counter for GPT4 vision remains 0 when calling OpenAIMultiModal models #10533

Open mw19930312 opened 9 months ago

mw19930312 commented 9 months ago

Bug Description

I'm trying to calculate the number of tokens consumed by my GPT4 vision call using the following code. however, the total_llm_token_count is always zero. Can anyone tell me how to configure the counter for GPT4 vision?

Version

0.9.37.post1

Steps to Reproduce

document = ImageDocument(text="", image=encoded_image) token_counter = TokenCountingHandler(tokenizer=tiktoken.encoding_for_model("gpt-4-vision-preview").encode) llm = OpenAIMultiModal( model="gpt-4-vision-preview", api_key=settings.OPENAI_API_KEY, max_new_tokens=300, additional_kwargs={"seed": 42}, callback_manager=CallbackManager([token_counter]), ) response = llm.complete(prompt, image_documents=[document]) print(token_counter.total_llm_token_count)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 9 months ago

🤖

Hey @mw19930312, great to see you back diving into the depths of LlamaIndex! 🦙

Based on the information you've provided, it seems like you're using the TokenCountingHandler correctly. However, there are a few things that could be causing the total_llm_token_count to remain zero.

Firstly, the on_event_end method in the TokenCountingHandler is responsible for updating the llm_token_counts list, which is then used to calculate the total_llm_token_count. This method checks if the event type is CBEventType.LLM and if the event type is not in the event_ends_to_ignore list. If these conditions are met and the payload is not None, it calls the get_llm_token_counts function with the token_counter and payload as arguments. The result is then appended to the llm_token_counts list. If these conditions are not met, the llm_token_counts list will not be updated, and the total_llm_token_count will remain zero. You can find more details about this in the source code.

Secondly, there have been changes in the way TokenCountingHandler works between the versions 0.9.37.post1 and 0.9.45.post1 of LlamaIndex. In the newer version, token counting has moved into a callback, and you now have more options for how tokens are counted, the lifetime of the token counts, and even creating separate token counters for different indexes. You can find more details about this in the migration guide.

Lastly, there have been similar issues reported in the past where the TokenCountingHandler was not working as expected. In one case, the issue was due to the overwriting of the TokenCountingHandler instance, which led to incorrect results. The user resolved the issue by saving the instance of TokenCountingHandler in a session state object. You can find more details about this in the issue #10482 and issue #9857.

I hope this helps! If you're still having issues, could you provide more details about your setup and how you're using the TokenCountingHandler? This will help me better understand the issue and provide a more accurate solution.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
hatianzhang commented 9 months ago

for MM model, we need to add support for CallbackManager. https://github.com/run-llama/llama_index/blob/main/llama_index/multi_modal_llms/base.py#L154-L166 @mw19930312 feel free to add a pr for adding this new feature. I think you can follow https://github.com/run-llama/llama_index/blob/main/llama_index/llms/base.py#L31-L280 to add it.

mw19930312 commented 9 months ago

for MM model, we need to add support for CallbackManager. https://github.com/run-llama/llama_index/blob/main/llama_index/multi_modal_llms/base.py#L154-L166 @mw19930312 feel free to add a pr for adding this new feature. I think you can follow https://github.com/run-llama/llama_index/blob/main/llama_index/llms/base.py#L31-L280 to add it.

Thanks for the reply @hatianzhang! Are you suggesting that we need to maintain the counter by ourselves? Is there any timeline for llama_index to officially support it?

hatianzhang commented 9 months ago

I mean we should add this feature from base class from llama_index abstraction.

mw19930312 commented 9 months ago

I mean we should add this feature from base class from llama_index abstraction.

I see. Thanks for the explanation! Would you mind elaborating a bit on how I should change my code? I'm not following the link you sent to me because I'm new to llama_index.

hatianzhang commented 9 months ago

it would be easier that I could probably make a pr from my branch for this. could you help verify it once the pr is there and you could use your example by checkout my branch/pr.

thanks

mw19930312 commented 9 months ago

Sure! Let me know when your pr is ready and I can test it on my side. Is there an expected timeline for you change?

mw19930312 commented 9 months ago

@hatianzhang ^^

hatianzhang commented 9 months ago

@mw19930312 sorry I am pretty busy with other tasks atm. will update asap

benbenz commented 8 months ago

I am not sure if I am doing something wrong, but it seems it is broken in some other cases: Sync calls with CondenseQuestionChatEngine Async calls with CondenseQuestionChatEngine , CondensePlusContextChatEngine and ContextChatEngine All those calls are used with a steaming response, when possible.

I have the same code setting up sync and async versions, so the fact there are differences in results seem to suggest there are some issues linked to LlamaIndex.

For CondenseQuestionChatEngine it seems it could be caused by conflicted use of service_context and Settings.callback_manager during initialization. But the underlying llm attribute has the callback_context variable set, so I don't understand why this is not used during generation.

EDIT: passing again service_context in any of the non-working calls above resolves the issue

talhalatifkhan commented 6 months ago

bump