unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.62k stars 1.23k forks source link

Facing error in using open source LLM #209

Open RiteshKB opened 4 weeks ago

RiteshKB commented 4 weeks ago

Hello,

I attempted to use the LLMExtractionStrategy code provided in the documentation for OpenAI and adapted it to work with Hugging Face. However, I encountered the following error: Provider List: https://docs.litellm.ai/docs/providers Provider List: https://docs.litellm.ai/docs/providers Error in thread execution: litellm.BadRequestError: GetLLMProvider Exception - list index out of range

Can you help me in resolution of the same?

Python code - import asyncio import nest_asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy import json import os from pydantic import BaseModel, Field from langchain_huggingface import HuggingFacePipeline, HuggingFaceEndpoint

nest_asyncio.apply()

repo_id = "meta-llama/Llama-3.1-8B"

llm = HuggingFaceEndpoint( repo_id=repo_id, temperature=0.5, huggingfacehub_api_token="", # provided the api_token )

class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def extract_openai_fees(hf_model): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun( url='https://openai.com/api/pricing/', word_count_threshold=1, extraction_strategy=LLMExtractionStrategy( provider="huggingface", model=hf_model, schema=OpenAIModelFee.schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" ), bypass_cache=True, ) print(len(result.extracted_content))

Run the async function using asyncio.run()

if name == "main": asyncio.run(extract_openai_fees(llm))

unclecode commented 3 weeks ago

@RiteshKB Sure, let me check your code and back to you.

unclecode commented 3 weeks ago

@RiteshKB Your code has some issues. Consider the following code. When using a HuggingFace model, I suggest downloading it first, as it works better. Now that ollama supports HuggingFace models, it's better to do it through ollama. Look at the following code and command line for help. just pay attention that what you pass to the provider is a mix of the provider name and the model name. Not to forget to pass your hugging face token. To ensure you're passing it correctly, check the litellm library (https://docs.litellm.ai/docs/providers); you won't need to pass anything like 'modeling'.

async def extract_openai_fees():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="huggingface/meta-llama/Meta-Llama-3.1-8B",
                api_token=os.environ["HUGGINGFACE_API_TOKEN"],
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
                Do not miss any models in the entire content. One extracted model JSON format should look like this:
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}""",
            ),
            bypass_cache=True,
        )
        print(len(result.extracted_content))