[Bug]: Not providing a model for the AzureOpenAI llm causes unexpected context size

Bug Description

When no model name is specified when creating an instance of AzureOpenAI a default is chosen which may differ from the model configured in the deployment in Azure. This can cause unexpected behaviour as the context size is chosen based on the model name.

I think model should not have a default value and if non is entered it should show a clear error warning that the model must be provided and that is must be the same name as the one selected in the Azure deployment.

Also inside the class there is the following comment:

- model: Name of the model (e.g. `text-davinci-003`)
        This in only used to decide completion vs. chat endpoint.

I believe this is not accurate as it is also used for the context size.

Version

0.10.31

Steps to Reproduce

Create an instance of AzureOpenAI like so:

    return AzureOpenAI(
        engine=engine,
        temperature=temperature,
        timeout=timeout,
        # model was not added here.
        api_key=AZURE_OPENAI_API_KEY,
        azure_endpoint=AZURE_OPENAI_API_ENDPOINT,
        api_version=AZURE_OPENAI_API_VERSION,
        max_tokens=max_tokens,
    )

If no model is specified the default of "gpt-35-turbo" is used. If your deployment uses a different model, this can cause unexpected behaviour as the context size will be set to 4096 even though you may have selected a model in Azure that has a larger context size.

I think this context size gets chosen based on the model name in this code:

    @property
    def metadata(self) -> LLMMetadata:
        return LLMMetadata(
            context_window=openai_modelname_to_contextsize(self._get_model_name()),
            num_output=self.max_tokens or -1,
            is_chat_model=is_chat_model(model=self._get_model_name()),
            is_function_calling_model=is_function_calling_model(
                model=self._get_model_name()
            ),
            model_name=self.model,
        )

Relevant Logs/Tracbacks

File “/usr/local/lib/python3.12/site-packages/llama_index/core/memory/chat_memory_buffer.py”, line 109, in get
raise ValueError(“Initial token count exceeds token limit”)
ValueError: Initial token count exceeds token limit

To address the issue where not specifying a model when creating an instance of AzureOpenAI leads to unexpected behavior due to a mismatch in context size, it's crucial to explicitly specify the model name during instantiation. This ensures that the context size matches the model configured in your Azure deployment, preventing errors like the one you encountered. Here's how you can specify the model name:

from llama_index.llms.azure_openai import AzureOpenAI

azure_engine = AzureOpenAI(
    model="<your-model-name>",  # Replace <your-model-name> with your actual model name
    engine="<your-deployment-name>",
    # other parameters as needed
)

This approach ensures that the AzureOpenAI class uses the correct model, aligning the context size with the model's capabilities and avoiding the ValueError related to token limits.

For dynamic handling based on the model's context size, you can use a function to determine the model's maximum token limit and adjust your application's behavior accordingly. This involves checking the input against the model's token limit before proceeding with operations that might exceed this limit. Here's a simplified example to illustrate this concept:

def get_model_context_size(model_name):
    # Logic to determine the model's context size
    # Placeholder for actual implementation

def ensure_input_within_limit(input_text, model_name):
    context_size = get_model_context_size(model_name)
    if context_size is None:
        # Handle error
        return False

    token_count = count_tokens(input_text)  # Placeholder for token counting logic
    if token_count > context_size:
        raise ValueError("Input exceeds the token limit for the specified model.")
    return True

This method requires implementing a count_tokens function tailored to your input text's tokenization needs. By dynamically adjusting to the model's token limit, you can prevent errors related to exceeding the model's context size.

Implementing these strategies ensures that your use of the AzureOpenAI class is aligned with the specific model you've configured in Azure, mitigating unexpected behavior and errors.

#### Details

**Sources** - [[Question]: How can I specify to use Azure openAI model for finetunning in `OpenAIFinetuneEngine`](https://github.com/run-llama/llama_index/issues/8004) - [llama-index-core/llama_index/core/memory/chat_memory_buffer.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/memory/chat_memory_buffer.py) - [llama-index-integrations/llms/llama-index-llms-portkey/llama_index/llms/portkey/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-portkey/llama_index/llms/portkey/utils.py) - [llama-index-legacy/llama_index/legacy/llms/azure_openai.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/azure_openai.py) - [llama-index-legacy/llama_index/legacy/multi_modal_llms/azure_openai.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/multi_modal_llms/azure_openai.py) - [llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-azure-openai/llama_index/multi_modal_llms/azure_openai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-azure-openai/llama_index/multi_modal_llms/azure_openai/base.py) - [llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py) - [llama-index-core/llama_index/core/callbacks/token_counting.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/callbacks/token_counting.py) - [llama-index-core/llama_index/core/utilities/token_counting.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/utilities/token_counting.py) - [llama-index-legacy/llama_index/legacy/callbacks/token_counting.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/callbacks/token_counting.py) - [llama-index-legacy/llama_index/legacy/utilities/token_counting.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/utilities/token_counting.py) - [llama-index-legacy/llama_index/legacy/embeddings/azure_openai.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/azure_openai.py) - [llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py) - [docs/docs/examples/callbacks/TokenCountingHandler.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/callbacks/TokenCountingHandler.ipynb) - [llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml) - [llama-index-legacy/llama_index/legacy/llms/ai21_utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/ai21_utils.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

It might be helpful to change the comment to something like:

- model: Name of the model (e.g. `text-davinci-003`)
    This is used to determine the context length, whether it's a chat model, and whether it supports function calling.

Additional, I think it's challenging for users to correctly get the full model name as Azure splits it into two fields model name and model version. All of the models listed in Azure docs lack standardized and unified names, which might make users confused.

For example, when viewing the table as below (see GPT-3.5 models for the full table):

Model ID	Max Request (tokens)	Training Data (up to)
gpt-35-turbo (0301)	4,096	Sep 2021
gpt-35-turbo (0613)	4,096	Sep 2021
...	...	...
gpt-35-turbo (1106)	Input: 16,385 Output: 4,096	Sep 2021

I'm not sure which model the default name "gpt-35-turbo" refers to. Likewise, I am uncertain which name, "gpt-35-turbor" or "gpt-35-turbo-1106", should be used if I want to use model gpt-35-turbo (1106).

I think it might be easier and more intuitive if we can just configure the settings such as context_window, is_chat_model and is_function_calling_model directly.

To address the concerns about model name mismatches and their impact on context size and other settings when using AzureOpenAI with the LlamaIndex framework, it's crucial to ensure that the engine parameter accurately reflects your Azure OpenAI deployment name. This parameter is key to aligning the model's behavior with your expectations, particularly regarding context size and endpoint selection.

When initializing the AzureOpenAIMultiModal class, you should specify both the model and engine parameters. The model parameter is used to decide whether to use the completion or chat endpoint, while the engine parameter should match your Azure OpenAI deployment name to ensure correct model behavior:

from llama_index.multi_modal_llms.azure_openai.base import AzureOpenAIMultiModal

llm = AzureOpenAIMultiModal(
    model="YOUR_MODEL_NAME",  # Determines completion vs. chat endpoint.
    engine="YOUR_AZURE_OPENAI_DEPLOYMENT_NAME",  # Should match your Azure deployment.
    api_key=aoai_api_key,
    azure_endpoint=aoai_endpoint,
    api_version=aoai_api_version,
    use_azure_ad=False,  # For Microsoft Entra ID token authentication.
)

This setup helps prevent issues like the ValueError: Initial token count exceeds token limit error, which occurs when the input exceeds the model's token limit. It's also worth noting that direct configuration of settings such as context_window, is_chat_model, and is_function_calling_model is possible and can be adjusted to fit specific use cases within the LlamaIndex framework.

By ensuring the engine parameter correctly matches your Azure deployment and adjusting settings directly as needed, you can mitigate potential mismatches and utilize AzureOpenAI more effectively within the LlamaIndex framework.

#### Details

**Sources** - [gpt-35-turbo fails with openai.error.InvalidRequestError](https://github.com/run-llama/llama_index/issues/1314) - [llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py) - [llama-index-legacy/llama_index/legacy/llms/azure_openai.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/azure_openai.py) - [llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py) - [llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-azure-openai/llama_index/multi_modal_llms/azure_openai/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-azure-openai/llama_index/multi_modal_llms/azure_openai/base.py) - [llama-index-legacy/llama_index/legacy/prompts/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/prompts/utils.py) - [llama-index-core/llama_index/core/base/llms/types.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/base/llms/types.py) - [llama-index-legacy/llama_index/legacy/core/llms/types.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/core/llms/types.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Yes the naming of models in Azure can be a bit tricky. But at the same time, wouldn't it be even more tedious for people to find the correct context windows for different models? Also finding the correct context window would still involve users figuring out what model they are using. Also not sure if changing the approach for AzureOpenAI vs OpenAI is a good idea, would make those 2 implementations work quite different right?

For those reasons I personally think it would be better to stick with using the model name. Just needs to be documented better and i.m.o. the default model value needs to be removed to prevent accidentally selecting the wrong model.

Feel free to make a PR to the azure class @mve

run-llama / llama_index