Issue: Incorrect multimodal token logging

Issue you'd like to raise.

When using traceable or wrappers.wrap_openai with a multimodal gpt-4o call the number input tokens seem to be incorrectly tracked.

This is the code I used to test:

test_image_path = "path_to_img/space.jpg"
test_image_url = "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"

test_image_bytes = encode_image(test_image_path)

model = "gpt-4o"
b64_messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe this image"},
        image_message_b64(test_image_bytes)
    ]
}]

url_messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe this image"},
        image_message_url(test_image_url)
    ]
}]

wrapped_base_client = wrappers.wrap_openai(OpenAI())

def wrapped_completion(messages):
    response = wrapped_base_client.chat.completions.create(
        model=model,
        messages=messages
    )
    print(f"Response: {response}")
    return "hello"

# With b64 enc image
wrapped_completion(b64_messages)

# With url image
wrapped_completion(url_messages)

When I check my Langsmith project, I see two traces. The one associated with the b64 encoded image reports ~0.5 million tokens inputted, and the URL method reports under 200 inputted tokens. After looking at the wrapper code, it appears that the text of the messages is just being combined and tokenized in order to get the input tokens. But this doesn't work for multimodal calls since the tokenization method isn't known. It appears the only accurate way to get the input token usage is to wait for the usage report from OpenAI.

I have implemented my own workaround using the REST API, but native support for accurate multimodal token tracking would be more helpful.

Suggestion:

No response

langchain-ai / langsmith-sdk

Issue: Incorrect multimodal token logging #837

Issue you'd like to raise.

Suggestion: