Tokeniser Issue - Githubissues

KishenKumar27 commented 9 months ago

The bug KeyError: 'Could not automatically map gpt-35-turbo to a tokeniser. Please use tiktok.get_encoding to explicitly get the tokeniser you expect.'

To Reproduce python==3.11.5 openai==1.12.0 guidance==0.1.10 model_name=gpt-35-turbo


from guidance import user, assistant, models

azureai_model = models.AzureOpenAIChat(
    model="gpt-35-turbo",
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    api_version=api_version,
    temperature=0.1,
)

with system():
    lm = azureai_model + "You are an intent classifier assistant from the sentences user provides."

with user():
    lm += f'''
        "inquire or many tables" is  could potentially denote a scenario involving the retrieval or investigation of information from multiple sources or datasets. 
        This could manifest in various contexts, such as database querying where one seeks data from numerous tables within a database, conducting research or investigations across diverse sources to gather insights, or in the realm of natural language processing or conversational AI where a user expresses a desire to access information from different categories or sources of knowledge. 
        The term encapsulates the idea of exploring and gathering insights from a variety of sources or datasets, whether they be structured data tables in a database or broader sources of information in a research or conversational context.

        "Not an inquiry" typically refers to a request or statement that is not seeking information or clarification on a particular topic. 
        In natural language processing or conversational AI contexts, this directive indicates that the user's input does not pertain to querying or requesting information. 
        Instead, it might involve commands, statements, or expressions of intent that do not require a response containing factual information or guidance. 
        Users employing this directive may be engaging in tasks such as giving commands, expressing opinions, making statements, or engaging in conversation for purposes other than seeking information. 
        The system recognizes this directive as a cue to refrain from treating the input as a query and instead respond appropriately based on the nature of the user's statement or request.

        "show list of tables" intent refers to a user's request to view a comprehensive list of tables within a database or similar data storage system. 
         In the context of database management or data querying, this intent prompts the system to retrieve and display a structured inventory of all available tables, providing users with a clear overview of the data organization. 
         This intent is commonly encountered in database administration tasks or when users need to navigate and understand the structure of a database, facilitating efficient data management, exploration, and analysis. 
         The system's response typically presents the table names along with any relevant metadata, enabling users to identify and select the specific tables they need for further actions or inquiries within the database environment.
        '''

with assistant():
        lm += f'''
        assistant: Given the user input: {{input}}
        Select ONLY one of the following intents: {{intents}}'''
        lm += gen('answer')

return lm['answer']

**System info**
 - OS (Mac OS):
 - Guidance Version (`0.1.10`):

Harsha-Nori commented 9 months ago

Hi @KishenKumar27, thanks for pointing this out. I'm noticing a few issues we've got in our AzureOpenAI classes, but I think the principal issue is that Azure deployments can have any user-specified name which means that we can't look at the name of the string and detect the appropriate tokenizer to initialize from tiktoken. In this case, for example, tiktoken's naming is gpt-3.5-turbo not gpt-35-turbo (note the extra '.'). However I think this problem is bigger than just detecting '.' characters, since you could have named your deployment anything.

I'm going to continue investigating this -- and leave this issue open in the interim -- but for now, you should pass in a tokenizer manually:

from guidance import user, assistant, models
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

azureai_model = models.AzureOpenAIChat(
    model="gpt-35-turbo",
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    api_version=api_version,
    temperature=0.1,
    tokenizer=enc,  # Manually pass in tokenizer for the model corresponding to your deployment. 
)

I'm hopeful that we can use some library function to detect the underlying model from an AzureOpenAI deployment (@riedgar-ms, any thoughts?), but haven't found a way to do so upon a quick search. Hopefully I'm just missing something simple! Thanks again for reporting this!

imarquart commented 9 months ago

Just fyi @Harsha-Nori there is a bug where if you initialize AzureOpenAIChat then it will not use the passed tokenizer since it hardcodes a request to tiktoken.

The trivial fix is to change line 109 in the AzureOpenAI file to tokenizer=tokenizer or tiktoken.encoding_for_model(model),

Here is a PR https://github.com/guidance-ai/guidance/pull/641

btw, one could also use AzureOpenAI function calling to mirror the use of grammars, are y'all interested in that?

Harsha-Nori commented 9 months ago

Good catch -- merged the PR in :).

btw, one could also use AzureOpenAI function calling to mirror the use of grammars, are y'all interested in that?

Do you mind expanding on this? I don't think we can leverage it for totally arbitrary grammars, but I can see how we could leverage it for e.g. JSON grammars.

guidance-ai / guidance

Tokeniser Issue #635