Closed MeghanaSrinath closed 2 weeks ago
@MeghanaSrinath Thanks for using Crawl4AI. The error message is coming from the litellm
library that we use to communicate with the language model. It seems that it cannot find the standard Open AI interface from the base URL that you passed. One thing we can do is try to use the standard Open AI base url (do not pass anything) and make sure that works. If that works, it means there must be something about the base URL that you are passing. In the worse scenario, you can create a temporary API token for me, and then I'll test it on my end to figure out why it doesn't work and I will fix it for you. Also please share with me the full code have you show me the full code, including the part where you are saving the data into tech_content.json.
me I use the .env with this and I don't put base_url in the LLMExtractionStrategy: AZURE_API_BASE=https://xxxxx.openai.azure.com/ AZURE_DEPLOYMENT=gpt4o-mini AZURE_API_VERSION="2024-06-01"
@mobyds Follow the explanation in this link https://docs.litellm.ai/docs/providers/azure
Sorry I wasn't clear. I have no problem accessing Azure Open AI on my side (I used the LiteLLM documentation). I was just giving my configuration to help @MeghanaSrinath
Apologies for delayed response. Below is the complete sample code that we have tried from the documentation. Since our project data is confidential, we would not be able to share a temporary token @unclecode
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
os.environ["AZURE_API_KEY"] = "xx"
os.environ["AZURE_API_BASE"] = "https://xx.openai.azure.com/"
os.environ["AZURE_API_VERSION"] = "xx"
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token Γfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
api_token = os.getenv('AZURE_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
We also added azure details as per @mobyds suggestion. But even so, after executing the above snippet, this is the error we are getting. It looks like, it is not able to recognize Azure OpenAI details and is always considering the URL as https://platform.openai.com, rather than the AZURE_API_BASE that we have set.
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:52379/devtools/browser/bcxxxx
[LOG] π€οΈ Warming up the WebCrawler
[LOG] π WebCrawler is ready to crawl
[LOG] π Crawling done for https://openai.com/api/pricing/, success: True, time taken: 2.84 seconds
[LOG] π Content extracted for https://openai.com/api/pricing/, success: True, time taken: 0.06 seconds
[LOG] π₯ Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: LLMExtractionStrategy
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 1
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 2
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 3
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 4
Error in thread execution: litellm.AuthenticationError: AuthenticationError: OpenAIException - Error code: 401 - {'error': {'message': 'Incorrect API key provided: xxx********************xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type':
'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
[LOG] π Extraction done for https://openai.com/api/pricing/, time taken: 9.78 seconds.
Also, the explanation provided in this link explains litellm completions. But our focus is to use crawl4ai tool without having to worry about the underlying implementation.
While trying LLMExtractionStrategy, we also tried JS execution examples from here. But even here, we are getting the below error. As per the crawl4ai documentation, we did not see any details on where the model details has to be added.
content": "litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}"
Please let me know if any other details are needed.
@MeghanaSrinath I notice you haven't passed the provider. Plz look at the following code. For this example, in my Azure account, I created a deployment for gpt-4o-mini
. You can check the following code, it works well.
In the following code, I have an example of creating a knowledge graph from one of the essays of Paul Graham. Just make sure that you are passing the correct form of the API base. Remember you api_base
should be something like this: https://<YOUR_ORG_NAME>.openai.azure.com/openai/deployments/gpt-4o-mini
.
import os
os.environ["AZURE_API_KEY"] = "YOUR_AZURE_API_KEY"
os.environ["AZURE_API_BASE"] = "YOUR_AZURE_API_BASE"
os.environ["AZURE_API_VERSION"] = "2024-02-15-preview" # This is just an example, please replace with the correct version
async def main():
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider = "azure/gpt-4o-mini",
api_base=os.environ["AZURE_API_BASE"],
api_token=os.environ["AZURE_API_KEY"],
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text."""
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
)
# print(result.extracted_content)
with open(os.path.join(__data__, "kb_test.json"), "w") as f:
f.write(result.extracted_content)
print("Done")
if __name__ == "__main__":
asyncio.run(main())
Just one very important thing: there were some mistakes in the previous versions of Crawl4ai, instead of the api_base
we had a url_base
. However, in the new version, we support both of them. So if you want to run the code above in the current version, change api_base
to url_base
But when you upgrade to 0.3.72, which we're going to release soon, then api_base
will work. I hope this resolve your issue.
Thanks for the help. We will try the suggestions.
@MeghanaSrinath appreciate of let me know if you have been able to manage it or not. Thx
Hi, We are trying to do the LLM extraction using the sample code provided here. This is how we have added the LLM details
These same credentials are working in other codes that we have for other use cases. However, when we try to run the sample code, we are getting the error as below.