Closed itaybar closed 3 weeks ago
Hey @itaybar How are you?
We'd love to have that but we'll probably need a solution that runs well with and without an API key.
We'll see what we can do. Please tell if you have any Component suggestion.
This will list all models including the ones that do not support Text.
@ogabrielluiz
Hello, how are you?
What do you think about parsing the OpenAI official documentation (https://platform.openai.com/docs/models) using BeautifulSoup and pyppeteer?
In my code, I fetch the list of models from the anchor at:
https://platform.openai.com/docs/models/models
From there, I parse the model tables on the entire page that include "token" and "CONTEXT WINDOW".
For text models, the expression "token" is essential, so it seems appropriate to use it as a filtering condition.
I also tried using requests, but due to a 403 error caused by Cloudflare, I had to resort to using pyppeteer.
OpenAI has consistently updated their documentation alongside their announcements, and it is likely that they will continue to do so in the future. Although the structure of the site may change, the format of the documentation has remained relatively stable. This approach should be sufficient to handle occasional user requests. Since Pyppeteer mimics an actual browser, integrating it directly into OpenAI components could result in frequent unnecessary connections. Therefore, it would be best to implement this as a custom component in a format that can be docked to the model name field.
Additionally, since the max token value can be retrieved, it would be beneficial to integrate this value automatically if the component is unified.
import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup
async def fetch_page(url):
"""
Fetches the HTML page from the given URL.
Args:
url (str): URL of the page
Returns:
str: HTML page content
"""
browser = await launch(headless=True, args=['--no-sandbox'])
page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
await page.goto(url)
# Wait until the table is loaded
await page.waitForSelector('.models-table table')
content = await page.content()
await browser.close()
await browser.disconnect()
return content
def parse_models(url):
"""
Extracts model names, descriptions, and token counts from the page.
Args:
url (str): URL of the model overview page
Returns:
dict: Dictionary containing model names, sub-model names, and token counts
"""
loop = asyncio.get_event_loop()
html_content = loop.run_until_complete(fetch_page(url))
soup = BeautifulSoup(html_content, 'html.parser')
models = {}
# Find model type sections
sections = soup.select('.anchor-heading-root')
for section in sections:
type_name = section.text.strip()
table = section.find_next_sibling('div').select_one('table')
if table:
headers = [header.text.strip() for header in table.find_all('th')]
rows = table.find_all('tr')[1:]
for row in rows:
cells = row.find_all('td')
# Ensure the row contains Model, Description, Context window, Training data
if len(cells) >= 4:
model_name = cells[0].text.strip()
detailed_model_name = cells[0].text.strip()
context_window = int(cells[2].text.strip().replace(
' tokens', '').replace(',', ''))
# Add sub-model names and token counts under the model type key
if type_name not in models:
models[type_name] = {}
models[type_name][detailed_model_name] = context_window
return models
def main():
overview_url = "https://platform.openai.com/docs/models/models-overview"
# Parse model information
model_details_dict = parse_models(overview_url)
# Print results
for type_name, details in model_details_dict.items():
print(f"{type_name}: {details}")
return model_details_dict
if __name__ == "__main__":
model_details_dict = main()
# Print model details in a readable format
for type_name, details in model_details_dict.items():
print(f"{type_name}: {details}")
The output results are as follows:
GPT-4o: {'gpt-4o': 128000, 'gpt-4o-2024-05-13': 128000}
GPT-4 Turbo and GPT-4: {'gpt-4-turbo': 128000, 'gpt-4-turbo-2024-04-09': 128000, 'gpt-4-turbo-preview': 128000, 'gpt-4-0125-preview': 128000, 'gpt-4-1106-preview': 128000, 'gpt-4-vision-preview': 128000, 'gpt-4-1106-vision-preview': 128000, 'gpt-4': 8192, 'gpt-4-0613': 8192, 'gpt-4-32k': 32768, 'gpt-4-32k-0613': 32768}
GPT-3.5 Turbo: {'gpt-3.5-turbo-0125': 16385, 'gpt-3.5-turbo': 16385, 'gpt-3.5-turbo-1106': 16385, 'gpt-3.5-turbo-instruct': 4096, 'gpt-3.5-turbo-16k': 16385, 'gpt-3.5-turbo-0613': 4096, 'gpt-3.5-turbo-16k-0613': 16385}
GPT base: {'babbage-002': 16384, 'davinci-002': 16384}
GPT-4o: {'gpt-4o': 128000, 'gpt-4o-2024-05-13': 128000}
GPT-4 Turbo and GPT-4: {'gpt-4-turbo': 128000, 'gpt-4-turbo-2024-04-09': 128000, 'gpt-4-turbo-preview': 128000, 'gpt-4-0125-preview': 128000, 'gpt-4-1106-preview': 128000, 'gpt-4-vision-preview': 128000, 'gpt-4-1106-vision-preview': 128000, 'gpt-4': 8192, 'gpt-4-0613': 8192, 'gpt-4-32k': 32768, 'gpt-4-32k-0613': 32768}
GPT-3.5 Turbo: {'gpt-3.5-turbo-0125': 16385, 'gpt-3.5-turbo': 16385, 'gpt-3.5-turbo-1106': 16385, 'gpt-3.5-turbo-instruct': 4096, 'gpt-3.5-turbo-16k': 16385, 'gpt-3.5-turbo-0613': 4096, 'gpt-3.5-turbo-16k-0613': 16385}
GPT base: {'babbage-002': 16384, 'davinci-002': 16384}
In fact, I am more hopeful that Selenium Hub will be integrated into Langflow rather than Pyppeteer. However, I currently lack the capability to make such modifications to the LF main code.
Instead of writing the models hardcoded, allow to get it from the SDK,
This will allow the get custom models from OpenAI like APIs like vllm