langflow-ai / langflow

Langflow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.
http://www.langflow.org
MIT License
30.06k stars 3.79k forks source link

Get Openai model list from api #1919

Closed itaybar closed 3 weeks ago

itaybar commented 4 months ago

Instead of writing the models hardcoded, allow to get it from the SDK,

from openai import OpenAI
import os

client = OpenAI(
    api_key = os.getenv('OPENAI_API_KEY')
)

models = client.models.list()

print(models)

This will allow the get custom models from OpenAI like APIs like vllm

ogabrielluiz commented 4 months ago

Hey @itaybar How are you?

We'd love to have that but we'll probably need a solution that runs well with and without an API key.

We'll see what we can do. Please tell if you have any Component suggestion.

ogabrielluiz commented 4 months ago

This will list all models including the ones that do not support Text.

YamonBot commented 4 months ago

@ogabrielluiz

Hello, how are you?

What do you think about parsing the OpenAI official documentation (https://platform.openai.com/docs/models) using BeautifulSoup and pyppeteer?

In my code, I fetch the list of models from the anchor at:

https://platform.openai.com/docs/models/models

From there, I parse the model tables on the entire page that include "token" and "CONTEXT WINDOW".

For text models, the expression "token" is essential, so it seems appropriate to use it as a filtering condition.

I also tried using requests, but due to a 403 error caused by Cloudflare, I had to resort to using pyppeteer.

OpenAI has consistently updated their documentation alongside their announcements, and it is likely that they will continue to do so in the future. Although the structure of the site may change, the format of the documentation has remained relatively stable. This approach should be sufficient to handle occasional user requests. Since Pyppeteer mimics an actual browser, integrating it directly into OpenAI components could result in frequent unnecessary connections. Therefore, it would be best to implement this as a custom component in a format that can be docked to the model name field.

Additionally, since the max token value can be retrieved, it would be beneficial to integrate this value automatically if the component is unified.

import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup

async def fetch_page(url):
    """
    Fetches the HTML page from the given URL.

    Args:
        url (str): URL of the page

    Returns:
        str: HTML page content
    """
    browser = await launch(headless=True, args=['--no-sandbox'])
    page = await browser.newPage()
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
    await page.goto(url)
    # Wait until the table is loaded
    await page.waitForSelector('.models-table table')
    content = await page.content()
    await browser.close()
    await browser.disconnect()
    return content

def parse_models(url):
    """
    Extracts model names, descriptions, and token counts from the page.

    Args:
        url (str): URL of the model overview page

    Returns:
        dict: Dictionary containing model names, sub-model names, and token counts
    """
    loop = asyncio.get_event_loop()
    html_content = loop.run_until_complete(fetch_page(url))
    soup = BeautifulSoup(html_content, 'html.parser')

    models = {}

    # Find model type sections
    sections = soup.select('.anchor-heading-root')

    for section in sections:
        type_name = section.text.strip()
        table = section.find_next_sibling('div').select_one('table')
        if table:
            headers = [header.text.strip() for header in table.find_all('th')]
            rows = table.find_all('tr')[1:]

            for row in rows:
                cells = row.find_all('td')
                # Ensure the row contains Model, Description, Context window, Training data
                if len(cells) >= 4:
                    model_name = cells[0].text.strip()
                    detailed_model_name = cells[0].text.strip()
                    context_window = int(cells[2].text.strip().replace(
                        ' tokens', '').replace(',', ''))

                    # Add sub-model names and token counts under the model type key
                    if type_name not in models:
                        models[type_name] = {}
                    models[type_name][detailed_model_name] = context_window

    return models

def main():
    overview_url = "https://platform.openai.com/docs/models/models-overview"

    # Parse model information
    model_details_dict = parse_models(overview_url)

    # Print results
    for type_name, details in model_details_dict.items():
        print(f"{type_name}: {details}")

    return model_details_dict

if __name__ == "__main__":
    model_details_dict = main()
    # Print model details in a readable format
    for type_name, details in model_details_dict.items():
        print(f"{type_name}: {details}")

The output results are as follows:

GPT-4o: {'gpt-4o': 128000, 'gpt-4o-2024-05-13': 128000}
GPT-4 Turbo and GPT-4: {'gpt-4-turbo': 128000, 'gpt-4-turbo-2024-04-09': 128000, 'gpt-4-turbo-preview': 128000, 'gpt-4-0125-preview': 128000, 'gpt-4-1106-preview': 128000, 'gpt-4-vision-preview': 128000, 'gpt-4-1106-vision-preview': 128000, 'gpt-4': 8192, 'gpt-4-0613': 8192, 'gpt-4-32k': 32768, 'gpt-4-32k-0613': 32768}
GPT-3.5 Turbo: {'gpt-3.5-turbo-0125': 16385, 'gpt-3.5-turbo': 16385, 'gpt-3.5-turbo-1106': 16385, 'gpt-3.5-turbo-instruct': 4096, 'gpt-3.5-turbo-16k': 16385, 'gpt-3.5-turbo-0613': 4096, 'gpt-3.5-turbo-16k-0613': 16385}
GPT base: {'babbage-002': 16384, 'davinci-002': 16384}
GPT-4o: {'gpt-4o': 128000, 'gpt-4o-2024-05-13': 128000}
GPT-4 Turbo and GPT-4: {'gpt-4-turbo': 128000, 'gpt-4-turbo-2024-04-09': 128000, 'gpt-4-turbo-preview': 128000, 'gpt-4-0125-preview': 128000, 'gpt-4-1106-preview': 128000, 'gpt-4-vision-preview': 128000, 'gpt-4-1106-vision-preview': 128000, 'gpt-4': 8192, 'gpt-4-0613': 8192, 'gpt-4-32k': 32768, 'gpt-4-32k-0613': 32768}
GPT-3.5 Turbo: {'gpt-3.5-turbo-0125': 16385, 'gpt-3.5-turbo': 16385, 'gpt-3.5-turbo-1106': 16385, 'gpt-3.5-turbo-instruct': 4096, 'gpt-3.5-turbo-16k': 16385, 'gpt-3.5-turbo-0613': 4096, 'gpt-3.5-turbo-16k-0613': 16385}
GPT base: {'babbage-002': 16384, 'davinci-002': 16384}
YamonBot commented 4 months ago

In fact, I am more hopeful that Selenium Hub will be integrated into Langflow rather than Pyppeteer. However, I currently lack the capability to make such modifications to the LF main code.