Closed arcontechnologies closed 2 months ago
I hit the same issue, I'm trying with Ollama, but same with OpenAI as well
(crawlAI) PS G:\Git_2\crawl_ai_test> python .\app.py
G:\Git_2\crawl_ai_test\crawlAI\Lib\site-packages\pydantic\_internal\_fields.py:160: UserWarning: Field "model_name" has conflict with protected namespace "model_".
You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
warnings.warn(
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:62719/devtools/browser/e53a9093-39f2-48ab-88b1-3642be8df631
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
Checking page load: 0
Checking page load: 1
Checking page load: 2
Checking page load: 3
Checking page load: 4
Checking page load: 5
[ERROR] π« Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 73050: character maps to <undefined>
I only changed the OpenAI model in the example code
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(...,
description="Fee for input token for the OpenAI model.")
output_fee: str = Field(...,
description="Fee for output token Γfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/llama3", api_token=os.getenv(''),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
Python 3.12.1 Windows 11
Even I am struck with the same error. kindly adverse
`(venv) PS C:\Users\Tringapps\Documents\webcrawler> python .\crawler.py C:\Users\Tringapps\Documents\webcrawler\venv\Lib\site-packages\pydantic_internal_fields.py:160: UserWarning: Field "modelname" has conflict with protected namespace "model".
You may be able to resolve this warning by setting model_config['protected_namespaces'] = ()
.
warnings.warn(
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:58565/devtools/browser/309edc10-bfc9-4266-970e-7c71d1d478b8
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
Checking page load: 0
Checking page load: 1
Checking page load: 2
Checking page load: 3
Checking page load: 4
Checking page load: 5
[ERROR] π« Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 71832: character maps to
Hi,
Can someone explain why I'm experiencing this. I tried different urls and same outcome :
π Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! π β³οΈ First Step: Create an instance of WebCrawler and call the `warmup()` function. If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files. [LOG] π Initializing LocalSeleniumCrawlerStrategy DevTools listening on ws://127.0.0.1:61566/devtools/browser/ff15fd50-5eab-447a-97bb-1ca9d57126bb [LOG] π€οΈ Warming up the WebCrawler Error retrieving cached URL: no such column: links [LOG] π WebCrawler is ready to crawl π οΈ Basic Usage: Simply provide a URL and let Crawl4ai do the magic! [LOG] πΈοΈ Crawling https://www.morningstar.fr/fr using LocalSeleniumCrawlerStrategy... Checking page load: 0 Checking page load: 1 Checking page load: 2 Checking page load: 3 Checking page load: 4 Checking page load: 5 [ERROR] π« Failed to crawl https://www.morningstar.fr/fr, error: Failed to crawl https://www.morningstar.fr/fr: 'charmap' codec can't encode character '\uff1a' in position 237304: character maps to <undefined> [LOG] π¦ Basic crawl result: Result: url: https://www.mornings... error_message: Failed to crawl http...
Windows 11 Python 3.11
Any insight ?
got same error why scraping the url if you found the solution then please share your solution
@arcontechnologies @Nikky000 This has been fixed and will be part of version 0.2.73 soon.
@Ritha24 @Udara-Sampath The issue with llama3 is its inability to generate correct JSON, causing trouble in converting it to standard JSON. We are striving to introduce an additional layer to assist smaller or inadequately trained models producing JSON output. It's important to note that we can't guarantee complete accuracy, especially with smaller models. We are considering implementing a fallback model. Nevertheless, the upcoming helper layer is expected to address the majority of these issues, so please stay updated.
I am also getting the same error
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:55270/devtools/browser/047fe112-fa77-4808-a19d-4e8330e43b76
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
Error retrieving cached URL: no such column: links
[ERROR] π« Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 71832: character maps to <undefined>
None
@xarchangel12 Please remove the "~/.crawl4ai" folder, as you're using the old cache database version. My bad for not removing it during the new version install. I'm pushing 0.2.74, so you can reinstall it. In the meantime, remove the folder and then install the library again.
@unclecode Hi, Still dealing with same error after upgrading to 0.2.73 and removing the .crawl4ai cache.
Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! π
β³οΈ First Step: Create an instance of WebCrawler and call the `warmup()` function.
If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:65485/devtools/browser/4abcfd66-2a6b-4e93-aee4-3bf423bec128
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
π οΈ Basic Usage: Simply provide a URL and let Crawl4ai do the magic!
[LOG] πΈοΈ Crawling https://openai.com/api/pricing/ using LocalSeleniumCrawlerStrategy...
[ERROR] π« Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 74044: character maps to <undefined>
[LOG] π¦ Basic crawl result:
Result:
url: https://openai.com/a...
error_message: Failed to crawl http...
my environment is : Windows 11 python 3.11 (venv)
@arcontechnologies Please clone the latest version from this link "https://github.com/unclecode/crawl4ai/tree/v0.2.74", ensure to remove the ".crawl4ai" folder. Afterwards, execute the provided code in a separate file, and instead of running the quickstart.py, run this code using your current version. Please let me know.
import os
import sys
from pathlib import Path
import numpy as np
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.crawler_strategy import LocalSeleniumCrawlerStrategy
from crawl4ai.web_crawler import WebCrawler
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token Γfor the OpenAI model.")
crawler = WebCrawler()
crawler.warmup()
url = r'https://openai.com/api/pricing/'
# Fetch a single page
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
I deleted the "~/.crawl4ai" folder and I am still facing the same error
[LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:51441/devtools/browser/2ced0dd3-a5e7-41a0-91c9-e58b1ea06d8a
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
Error retrieving cached URL: no such column: links
[ERROR] π« Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 72375: character maps to <undefined>
None
and here is the test code I used like in the video
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://openai.com/api/pricing/")
print(result.markdown)
@xarchangel12 Found it at last! I had been searching for this for a few days! Could you kindly clean up everything, then proceed with these steps and test your code again? The code below shows how I tested it on my Windows system. It is now functioning after resolving the problem.
mkdir crawl4ai-v74
git clone -b v0.2.74 https://github.com/unclecode/crawl4ai .\crawl4ai-v74\
py -3.11 -m venv myenv
myenv\Scripts\activate
pip install -e .
@xarchangel12 I have already pushed the latest version so you can clone the main branch and no need "v0.2.74", thx
@unclecode thanks for your feedback. that worked for me with little update regarding subprocess call in setup.py. Actually you code :
if os.path.exists(f"{crawl4ai_folder}/cache"):
subprocess.run(["rm", "-rf", f"{crawl4ai_folder}/cache"])
is referring to Unix-style commands where in Windows that should looks something like this :
cache_path = os.path.join(crawl4ai_folder, 'cache')
if os.path.exists(cache_path):
shutil.rmtree(cache_path)
That's being said. I would like to thank you for all the hard work you're doing to make this a package a successful one.
@arcontechnologies Thx for your nice words and you are absolutely right. I will update it soon.
Hi,
Can someone explain why I'm experiencing this. I tried different urls and same outcome :
Windows 11 Python 3.11
Any insight ?