unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
15.49k stars 1.11k forks source link

'charmap' codec can't encode characters in position 15540-15544: character maps to <undefined> #42

Closed Nikky000 closed 4 months ago

Nikky000 commented 4 months ago

here is the code : -

from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai import WebCrawler import together import litellm from litellm import completion

Initialize and warm up the crawler

crawler = WebCrawler() crawler.warmup()

Define the extraction strategy

strategy = LLMExtractionStrategy( provider='together_ai/togethercomputer/llama-2-70b-chat', api_token='api_key', instruction="extract all the reviews" )

Sample URL

url = "https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy"

Run the crawler with the extraction strategy

try: result = crawler.run(url=url, extraction_strategy=strategy)

Ensure the content is in UTF-8

if result.extracted_content is None:
    raise ValueError("No content extracted")
extracted_content = result.extracted_content.encode('utf-8', errors='replace').decode('utf-8')
print(extracted_content)

except Exception as e: print(f"[ERROR] Failed to crawl {url}, error: {e}")

error : -

[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy [LOG] 🌀️ Warming up the WebCrawler Error retrieving cached URL: no such column: links [LOG] 🌞 WebCrawler is ready to crawl Error retrieving cached URL: no such column: links Checking page load: 0 Checking page load: 1 Checking page load: 2 Checking page load: 3 Checking page load: 4 Checking page load: 5 [ERROR] 🚫 Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy, error: Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy: 'charmap' codec can't encode characters in position 15540-15544: character maps to [ERROR] 🚫 Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy, error: No content extracted

unclecode commented 4 months ago

@Nikky000 This is already resolved and will be available in version v0.2.73.

image

See if you can spot what was the issue? πŸ˜‰

Ritha24 commented 4 months ago

I tried the new v0.2.73 version, but still facing the same problem in extraction.

`(venv) PS C:\Users\Tringapps\Documents\webcrawler> python .\crawler.py [LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

DevTools listening on ws://127.0.0.1:61955/devtools/browser/b634b0e7-4863-4dd4-a1db-42ebc1cc67ba [LOG] 🌀️ Warming up the WebCrawler Error retrieving cached URL: no such column: links [LOG] 🌞 WebCrawler is ready to crawl Error retrieving cached URL: no such column: links [ERROR] 🚫 Failed to crawl https://www.dell.com/support/home/en-hk/product-support/product/alienware-alpha/overview, error: Failed to crawl https://www.dell.com/support/home/en-hk/product-support/product/alienware-alpha/overview: 'charmap' codec can't encode character '\u2713' in position 39308: character maps to
None`

updated the latest version

(venv) PS C:\Users\Tringapps\Documents\webcrawler> pip show crawl4ai Name: Crawl4AI Version: 0.2.73 Summary: πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper Home-page: https://github.com/unclecode/crawl4ai Author: Unclecode Author-email: unclecode@kidocode.com License: MIT Location: C:\Users\Tringapps\Documents\webcrawler\venv\Lib\site-packages Requires: aiohttp, aiosqlite, beautifulsoup4, chromedriver-autoinstaller, fastapi, html2text, httpx, litellm, pillow, pydantic, python-dotenv, requests, rich, selenium, uvicorn, webdriver-manager

unclecode commented 4 months ago

@Ritha24, could you please share the code snippet? I am able to crawl the link but am unable to replicate your error. Additionally, could you debug the source code and indicate which line caused this issue?

image

Ritha24 commented 4 months ago

I am just trying to import the crawl4ai as a library. I have a doubt like, will this crawl4ai look for the sub-URLs and crawl through that? Below is the code snippet.

import os from crawl4ai import WebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/' crawler = WebCrawler() crawler.warmup()

result = crawler.run( url=url, word_count_threshold=1, extraction_strategy= LLMExtractionStrategy( provider= "openai/gpt-4o", api_token = os.getenv( 'OPENAI_API_KEY'), schema=OpenAIModelFee.schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" ), bypass_cache=True, )

print(result.extracted_content)

On Wed, Jul 3, 2024 at 2:46β€―PM UncleCode @.***> wrote:

@Ritha24 https://github.com/Ritha24, could you please share the code snippet? I am able to crawl the link but am unable to replicate your error. Additionally, could you debug the source code and indicate which line caused this issue?

image.png (view on web) https://github.com/unclecode/crawl4ai/assets/12494079/a63945f1-b400-4772-a638-3ee769aa3354

β€” Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/42#issuecomment-2205514522, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGMUFUJ3LETI5HZYGWIYEPTZKO6N7AVCNFSM6AAAAABKHNAHF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBVGUYTINJSGI . You are receiving this because you were mentioned.Message ID: @.***>

unclecode commented 4 months ago

We already have a backlog item to add deep crawling. The plan is to extract a website sitemap and then extract all the links. Another version involves using a graph search algorithm to extract links based on a specific depth and a threshold to determine if enough information is collected. We'll release these in future versions.

Regarding the code, it seems okay and runs on different machines. I'm not sure why it's not working on your side. Do me a favor: if you're using this code in your debug configuration file, set the "justMyCode" flag to false to allow debugging of third-party libraries. Then, run the code and look at the error traceback. Share the traceback with me so I can identify which part of the library is causing the issue and backtrack it.

Right now when I execute your code, it works.

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this:
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),
        bypass_cache=True,
    )

print(result.extracted_content)
unclecode commented 4 months ago

@Ritha24 I close this issue but you are welcome to continue your questions here.