[FEATURE REQUEST] langchain-openai - max_tokens (vs max_context?) ability to use full LLM contexts and account for user-messages automatically.

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser

dotenv.load_dotenv()
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.2,

    # NOTE: setting max_tokens to "100" works. Setting to 8192 or something slightly lower does not.
    max_tokens=8160
)

output_parser = StrOutputParser()

prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer all questions to the best of your ability."),
    MessagesPlaceholder(variable_name="messages"),
])

chain = prompt_template | llm | output_parser

response = chain.invoke({
    "messages": [
        HumanMessage(content="what llm are you 1? what llm are you 2? what llm are you 3? what llm are you 4? what llm are you 5? what llm are you 6?"),
    ],
})

print(response)

Error Message and Stack Trace (if applicable)

raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, you requested 8235 tokens (75 in the messages, 8160 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Description

max_tokens is not correctly accounting for user prompt.

If you specify a max_tokens of 100 as an example, it "correctly accounts for it" (not really, but gives a result), by simply having the extra room in the context window to expand into. With any given prompt, it will produce the expected result.

However, If you specify a max_tokens (for GPT4 as an example of "8192" or "8100", etc. it does not. This means max_tokens is effectively not implemented correctly.

System Info

langchain==0.1.20 langchain-aws==0.1.4 langchain-community==0.0.38 langchain-core==0.1.52 langchain-google-vertexai==1.0.3 langchain-openai==0.1.7 langchain-text-splitters==0.0.2

platform mac Python 3.11.6

Howdy! Will keep this open to prevent cycling through reopens of the same issue, but could you read through the openai docs for the max_tokens param and see if this is different than when using OpenAI directly? https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens

Context windows limit the total number of tokens (input + generated), and the max_tokens parameter dictates the max generated tokens.

Here's a curl command that yields the same behavior:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4",
    "max_tokens": 8160,
    "temperature": 0.2,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Answer all questions to the best of your ability."
      },
      {
        "role": "user",
        "content": "what llm are you 1? what llm are you 2? what llm are you 3? what llm are you 4? what llm are you 5? what llm are you 6?"
      }
    ]
  }'

Hi @efriis - thank you.

I understand why it was designed this way if the goal was to mimic the OpenAI API directly. It makes sense with the API (imo at least), but I think with LangChain, given that it borders on a much more useful wrapper/framework, it seems almost as a missing feature/"bug".

If the goal is for max_tokens to mimic the API, I can see how "max_tokens" is mapped to the OpenAI max tokens.

Would it possibly make sense to add an additional parameter (ie: max_context, etc) that would auto calculate this?

I know the end user can calculate it (this is what we are doing currently, subtracting the system and user prompts, and accounting for the 11 tokens of metadata), but it just seems silly that you can't effectively give it a parameter that says "use as much tokens as you can". Maybe even making a "max_context" a boolean, and just defaulting it to False?

Ex:

import dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
import tiktoken

dotenv.load_dotenv()

def calculate_tokens(messages, model="gpt-4"):
    enc = tiktoken.encoding_for_model(model)
    tokens = 0
    for message in messages:
        encoded_message = enc.encode(message.content)
        print(f"Message: {message.content}\nTokens: {encoded_message}\nToken Count: {len(encoded_message)}\n")
        tokens += len(encoded_message)
    return tokens

# Define the system message and human messages
system_message = SystemMessage(content="You are a helpful assistant. Answer all questions to the best of your ability.")
human_messages = [
    HumanMessage(content="what llm are you 1? what llm are you 2? what llm are you 3? what llm are you 4? what llm are you 5? what llm are you 6? what llm are you 7? what llm are you 8? what llm are you 9? what llm are you 10? what llm are you 11?"),
]

# Combine all messages
all_messages = [system_message] + human_messages

# Calculate tokens in all messages
message_tokens = calculate_tokens(all_messages)
print(f"Total Message Tokens: {message_tokens}")

max_context_length = 8192

# Calculate max_tokens to be passed to the model, by accounting for user messages and subtracting 
# NOTE: There is a metadata/buffer of 11 tokens always added to all user+system prompts.
max_tokens = max_context_length - message_tokens - 11
print(f"Calculated Max Tokens: {max_tokens}")

# Initialize the ChatOpenAI model with the adjusted max_tokens
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.2,
    max_tokens=max_tokens
)

output_parser = StrOutputParser()

# Create the prompt template using the system message and placeholder for human messages
prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_message.content),
    MessagesPlaceholder(variable_name="messages"),
])

chain = prompt_template | llm | output_parser

try:
    response = chain.invoke({
        "messages": human_messages
    })
    print(response)
except Exception as e:
    print(e)

What do you think?

langchain-ai / langchain