unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
17k stars 1.26k forks source link

Unable to do LLM extraction with azure openai #174

Closed MeghanaSrinath closed 2 weeks ago

MeghanaSrinath commented 1 month ago

Hi, We are trying to do the LLM extraction using the sample code provided here. This is how we have added the LLM details

async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                base_url="https://xxx.openai.azure.com/openai/deployments/xx/chat/completions?api-version=xx",
                api_token="xxxx", 
                instruction="Extract only content related to technology"
            ),
            bypass_cache=True,
        )

These same credentials are working in other codes that we have for other use cases. However, when we try to run the sample code, we are getting the error as below.

[LOG] 🌀️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] πŸ•ΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] βœ… Crawled https://www.nbcnews.com/business successfully!
[LOG] πŸš€ Crawling done for https://www.nbcnews.com/business, success: True, time taken: 8.29 seconds
[LOG] πŸš€ Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.34 seconds
[LOG] πŸ”₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 0
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 1
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 2
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 3

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

[LOG] Call LLM for https://www.nbcnews.com/business - block index: 4
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 5
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] πŸš€ Extraction done for https://www.nbcnews.com/business, time taken: 33.02 seconds.
Number of tech-related items extracted: 6
Traceback (most recent call last):
  File "C:\test.py", line 31, in <module>
    asyncio.run(extract_tech_content())
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete     
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\test.py", line 28, in extract_tech_content
    with open(".data/tech_content.json", "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.data/tech_content.json'
unclecode commented 1 month ago

@MeghanaSrinath Thanks for using Crawl4AI. The error message is coming from the litellm library that we use to communicate with the language model. It seems that it cannot find the standard Open AI interface from the base URL that you passed. One thing we can do is try to use the standard Open AI base url (do not pass anything) and make sure that works. If that works, it means there must be something about the base URL that you are passing. In the worse scenario, you can create a temporary API token for me, and then I'll test it on my end to figure out why it doesn't work and I will fix it for you. Also please share with me the full code have you show me the full code, including the part where you are saving the data into tech_content.json.

mobyds commented 1 month ago

me I use the .env with this and I don't put base_url in the LLMExtractionStrategy: AZURE_API_BASE=https://xxxxx.openai.azure.com/ AZURE_DEPLOYMENT=gpt4o-mini AZURE_API_VERSION="2024-06-01"

unclecode commented 1 month ago

@mobyds Follow the explanation in this link https://docs.litellm.ai/docs/providers/azure

image

mobyds commented 1 month ago

Sorry I wasn't clear. I have no problem accessing Azure Open AI on my side (I used the LiteLLM documentation). I was just giving my configuration to help @MeghanaSrinath

MeghanaSrinath commented 1 month ago

Apologies for delayed response. Below is the complete sample code that we have tried from the documentation. Since our project data is confidential, we would not be able to share a temporary token @unclecode

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

os.environ["AZURE_API_KEY"] = "xx"  
os.environ["AZURE_API_BASE"] = "https://xx.openai.azure.com/"
os.environ["AZURE_API_VERSION"] = "xx"

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            api_token = os.getenv('AZURE_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)

We also added azure details as per @mobyds suggestion. But even so, after executing the above snippet, this is the error we are getting. It looks like, it is not able to recognize Azure OpenAI details and is always considering the URL as https://platform.openai.com, rather than the AZURE_API_BASE that we have set.

[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:52379/devtools/browser/bcxxxx
[LOG] 🌀️  Warming up the WebCrawler
[LOG] 🌞 WebCrawler is ready to crawl
[LOG] πŸš€ Crawling done for https://openai.com/api/pricing/, success: True, time taken: 2.84 seconds
[LOG] πŸš€ Content extracted for https://openai.com/api/pricing/, success: True, time taken: 0.06 seconds
[LOG] πŸ”₯ Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: LLMExtractionStrategy
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 1
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 2
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 3
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 4
Error in thread execution: litellm.AuthenticationError: AuthenticationError: OpenAIException - Error code: 401 - {'error': {'message': 'Incorrect API key provided: xxx********************xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 
'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
[LOG] πŸš€ Extraction done for https://openai.com/api/pricing/, time taken: 9.78 seconds.

Also, the explanation provided in this link explains litellm completions. But our focus is to use crawl4ai tool without having to worry about the underlying implementation.

While trying LLMExtractionStrategy, we also tried JS execution examples from here. But even here, we are getting the below error. As per the crawl4ai documentation, we did not see any details on where the model details has to be added.

content": "litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}"

Please let me know if any other details are needed.

unclecode commented 1 month ago

@MeghanaSrinath I notice you haven't passed the provider. Plz look at the following code. For this example, in my Azure account, I created a deployment for gpt-4o-mini. You can check the following code, it works well.

In the following code, I have an example of creating a knowledge graph from one of the essays of Paul Graham. Just make sure that you are passing the correct form of the API base. Remember you api_base should be something like this: https://<YOUR_ORG_NAME>.openai.azure.com/openai/deployments/gpt-4o-mini.

import os
os.environ["AZURE_API_KEY"] = "YOUR_AZURE_API_KEY"
os.environ["AZURE_API_BASE"] = "YOUR_AZURE_API_BASE"
os.environ["AZURE_API_VERSION"] = "2024-02-15-preview" # This is just an example, please replace with the correct version

async def main():
    class Entity(BaseModel):
        name: str
        description: str

    class Relationship(BaseModel):
        entity1: Entity
        entity2: Entity
        description: str
        relation_type: str

    class KnowledgeGraph(BaseModel):
        entities: List[Entity]
        relationships: List[Relationship]

    extraction_strategy = LLMExtractionStrategy(
            provider = "azure/gpt-4o-mini", 
            api_base=os.environ["AZURE_API_BASE"],
            api_token=os.environ["AZURE_API_KEY"],
            schema=KnowledgeGraph.model_json_schema(),
            extraction_type="schema",
            instruction="""Extract entities and relationships from the given text."""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
        )
        # print(result.extracted_content)
        with open(os.path.join(__data__, "kb_test.json"), "w") as f:
            f.write(result.extracted_content)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())

Just one very important thing: there were some mistakes in the previous versions of Crawl4ai, instead of the api_base we had a url_base. However, in the new version, we support both of them. So if you want to run the code above in the current version, change api_base to url_base But when you upgrade to 0.3.72, which we're going to release soon, then api_base will work. I hope this resolve your issue.

MeghanaSrinath commented 2 weeks ago

Thanks for the help. We will try the suggestions.

unclecode commented 1 week ago

@MeghanaSrinath appreciate of let me know if you have been able to manage it or not. Thx