unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
11.37k stars 794 forks source link

Error on 'charmap' codec can't encode character #40

Closed arcontechnologies closed 2 months ago

arcontechnologies commented 3 months ago

Hi,

Can someone explain why I'm experiencing this. I tried different urls and same outcome :

🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐
⛳️ First Step: Create an instance of WebCrawler and call the `warmup()` function.
If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.
[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:61566/devtools/browser/ff15fd50-5eab-447a-97bb-1ca9d57126bb
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
πŸ› οΈ  Basic Usage: Simply provide a URL and let Crawl4ai do the magic!
[LOG] πŸ•ΈοΈ Crawling https://www.morningstar.fr/fr using LocalSeleniumCrawlerStrategy...
Checking page load: 0
Checking page load: 1
Checking page load: 2
Checking page load: 3
Checking page load: 4
Checking page load: 5
[ERROR] 🚫 Failed to crawl https://www.morningstar.fr/fr, error: Failed to crawl https://www.morningstar.fr/fr: 'charmap' codec can't encode character '\uff1a' in position 237304: character maps to <undefined>
[LOG] πŸ“¦ Basic crawl result:
        Result:
        url: https://www.mornings...
        error_message: Failed to crawl http... 

Windows 11 Python 3.11

Any insight ?

Udara-Sampath commented 3 months ago

I hit the same issue, I'm trying with Ollama, but same with OpenAI as well

(crawlAI) PS G:\Git_2\crawl_ai_test> python .\app.py
G:\Git_2\crawl_ai_test\crawlAI\Lib\site-packages\pydantic\_internal\_fields.py:160: UserWarning: Field "model_name" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:62719/devtools/browser/e53a9093-39f2-48ab-88b1-3642be8df631
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
Checking page load: 0
Checking page load: 1
Checking page load: 2
Checking page load: 3
Checking page load: 4
Checking page load: 5
[ERROR] 🚫 Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 73050: character maps to <undefined>

I only changed the OpenAI model in the example code

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(...,
                           description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(...,
                            description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy=LLMExtractionStrategy(
        provider="ollama/llama3", api_token=os.getenv(''),
        schema=OpenAIModelFee.schema(),
        extraction_type="schema",
        instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
    ),
    bypass_cache=True,
)

print(result.extracted_content)

Python 3.12.1 Windows 11

Ritha24 commented 3 months ago

Even I am struck with the same error. kindly adverse

`(venv) PS C:\Users\Tringapps\Documents\webcrawler> python .\crawler.py C:\Users\Tringapps\Documents\webcrawler\venv\Lib\site-packages\pydantic_internal_fields.py:160: UserWarning: Field "modelname" has conflict with protected namespace "model".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = (). warnings.warn( [LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:58565/devtools/browser/309edc10-bfc9-4266-970e-7c71d1d478b8 [LOG] 🌀️ Warming up the WebCrawler Error retrieving cached URL: no such column: links [LOG] 🌞 WebCrawler is ready to crawl Checking page load: 0 Checking page load: 1 Checking page load: 2 Checking page load: 3 Checking page load: 4 Checking page load: 5 [ERROR] 🚫 Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 71832: character maps to None`

Nikky000 commented 3 months ago

Hi,

Can someone explain why I'm experiencing this. I tried different urls and same outcome :

🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐
⛳️ First Step: Create an instance of WebCrawler and call the `warmup()` function.
If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.
[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:61566/devtools/browser/ff15fd50-5eab-447a-97bb-1ca9d57126bb
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
πŸ› οΈ  Basic Usage: Simply provide a URL and let Crawl4ai do the magic!
[LOG] πŸ•ΈοΈ Crawling https://www.morningstar.fr/fr using LocalSeleniumCrawlerStrategy...
Checking page load: 0
Checking page load: 1
Checking page load: 2
Checking page load: 3
Checking page load: 4
Checking page load: 5
[ERROR] 🚫 Failed to crawl https://www.morningstar.fr/fr, error: Failed to crawl https://www.morningstar.fr/fr: 'charmap' codec can't encode character '\uff1a' in position 237304: character maps to <undefined>
[LOG] πŸ“¦ Basic crawl result:
        Result:
        url: https://www.mornings...
        error_message: Failed to crawl http... 

Windows 11 Python 3.11

Any insight ?

got same error why scraping the url if you found the solution then please share your solution

unclecode commented 3 months ago

@arcontechnologies @Nikky000 This has been fixed and will be part of version 0.2.73 soon.

image

unclecode commented 3 months ago

@Ritha24 @Udara-Sampath The issue with llama3 is its inability to generate correct JSON, causing trouble in converting it to standard JSON. We are striving to introduce an additional layer to assist smaller or inadequately trained models producing JSON output. It's important to note that we can't guarantee complete accuracy, especially with smaller models. We are considering implementing a fallback model. Nevertheless, the upcoming helper layer is expected to address the majority of these issues, so please stay updated.

jasonjos111 commented 3 months ago

I am also getting the same error

[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:55270/devtools/browser/047fe112-fa77-4808-a19d-4e8330e43b76
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
Error retrieving cached URL: no such column: links
[ERROR] 🚫 Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 71832: character maps to <undefined>
None
unclecode commented 3 months ago

@xarchangel12 Please remove the "~/.crawl4ai" folder, as you're using the old cache database version. My bad for not removing it during the new version install. I'm pushing 0.2.74, so you can reinstall it. In the meantime, remove the folder and then install the library again.

arcontechnologies commented 3 months ago

@unclecode Hi, Still dealing with same error after upgrading to 0.2.73 and removing the .crawl4ai cache.

Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐
⛳️ First Step: Create an instance of WebCrawler and call the `warmup()` function.
If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.
[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:65485/devtools/browser/4abcfd66-2a6b-4e93-aee4-3bf423bec128
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
πŸ› οΈ  Basic Usage: Simply provide a URL and let Crawl4ai do the magic!
[LOG] πŸ•ΈοΈ Crawling https://openai.com/api/pricing/ using LocalSeleniumCrawlerStrategy...
[ERROR] 🚫 Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 74044: character maps to <undefined>
[LOG] πŸ“¦ Basic crawl result:
        Result:
        url: https://openai.com/a...
        error_message: Failed to crawl http...

my environment is : Windows 11 python 3.11 (venv)

unclecode commented 3 months ago

@arcontechnologies Please clone the latest version from this link "https://github.com/unclecode/crawl4ai/tree/v0.2.74", ensure to remove the ".crawl4ai" folder. Afterwards, execute the provided code in a separate file, and instead of running the quickstart.py, run this code using your current version. Please let me know.

import os
import sys
from pathlib import Path
import numpy as np
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.crawler_strategy import LocalSeleniumCrawlerStrategy
from crawl4ai.web_crawler import WebCrawler
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

crawler = WebCrawler()
crawler.warmup()

url = r'https://openai.com/api/pricing/'

# Fetch a single page
result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy= LLMExtractionStrategy(
        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
        schema=OpenAIModelFee.model_json_schema(),
        extraction_type="schema",
        instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
        Do not miss any models in the entire content. One extracted model JSON format should look like this: 
        {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
    ),            
    bypass_cache=True,
)
jasonjos111 commented 3 months ago

https://github.com/unclecode/crawl4ai/tree/v0.2.74

I deleted the "~/.crawl4ai" folder and I am still facing the same error

[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:51441/devtools/browser/2ced0dd3-a5e7-41a0-91c9-e58b1ea06d8a
[LOG] 🌀️  Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] 🌞 WebCrawler is ready to crawl
Error retrieving cached URL: no such column: links
[ERROR] 🚫 Failed to crawl https://openai.com/api/pricing/, error: Failed to crawl https://openai.com/api/pricing/: 'charmap' codec can't encode character '\U0001f47b' in position 72375: character maps to <undefined>
None

and here is the test code I used like in the video


crawler = WebCrawler()

crawler.warmup()

result = crawler.run(url="https://openai.com/api/pricing/")
print(result.markdown)
unclecode commented 2 months ago

@xarchangel12 Found it at last! I had been searching for this for a few days! Could you kindly clean up everything, then proceed with these steps and test your code again? The code below shows how I tested it on my Windows system. It is now functioning after resolving the problem.

mkdir crawl4ai-v74
git clone -b v0.2.74 https://github.com/unclecode/crawl4ai .\crawl4ai-v74\
py -3.11 -m venv myenv
myenv\Scripts\activate
pip install -e .
unclecode commented 2 months ago

@xarchangel12 I have already pushed the latest version so you can clone the main branch and no need "v0.2.74", thx

arcontechnologies commented 2 months ago

@unclecode thanks for your feedback. that worked for me with little update regarding subprocess call in setup.py. Actually you code :

if os.path.exists(f"{crawl4ai_folder}/cache"):
     subprocess.run(["rm", "-rf", f"{crawl4ai_folder}/cache"])

is referring to Unix-style commands where in Windows that should looks something like this :

cache_path = os.path.join(crawl4ai_folder, 'cache')
if os.path.exists(cache_path):
    shutil.rmtree(cache_path)

That's being said. I would like to thank you for all the hard work you're doing to make this a package a successful one.

unclecode commented 2 months ago

@arcontechnologies Thx for your nice words and you are absolutely right. I will update it soon.