Closed Nikky000 closed 4 months ago
@Nikky000 This is already resolved and will be available in version v0.2.73.
See if you can spot what was the issue? π
I tried the new v0.2.73 version, but still facing the same problem in extraction.
`(venv) PS C:\Users\Tringapps\Documents\webcrawler> python .\crawler.py [LOG] π Initializing LocalSeleniumCrawlerStrategy
DevTools listening on ws://127.0.0.1:61955/devtools/browser/b634b0e7-4863-4dd4-a1db-42ebc1cc67ba
[LOG] π€οΈ Warming up the WebCrawler
Error retrieving cached URL: no such column: links
[LOG] π WebCrawler is ready to crawl
Error retrieving cached URL: no such column: links
[ERROR] π« Failed to crawl https://www.dell.com/support/home/en-hk/product-support/product/alienware-alpha/overview, error: Failed to crawl https://www.dell.com/support/home/en-hk/product-support/product/alienware-alpha/overview: 'charmap' codec can't encode character '\u2713' in position 39308: character maps to
None`
updated the latest version
(venv) PS C:\Users\Tringapps\Documents\webcrawler> pip show crawl4ai Name: Crawl4AI Version: 0.2.73 Summary: π₯π·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper Home-page: https://github.com/unclecode/crawl4ai Author: Unclecode Author-email: unclecode@kidocode.com License: MIT Location: C:\Users\Tringapps\Documents\webcrawler\venv\Lib\site-packages Requires: aiohttp, aiosqlite, beautifulsoup4, chromedriver-autoinstaller, fastapi, html2text, httpx, litellm, pillow, pydantic, python-dotenv, requests, rich, selenium, uvicorn, webdriver-manager
@Ritha24, could you please share the code snippet? I am able to crawl the link but am unable to replicate your error. Additionally, could you debug the source code and indicate which line caused this issue?
I am just trying to import the crawl4ai as a library. I have a doubt like, will this crawl4ai look for the sub-URLs and crawl through that? Below is the code snippet.
import os from crawl4ai import WebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") output_fee: str = Field(..., description="Fee for output token Γfor the OpenAI model.")
url = 'https://openai.com/api/pricing/' crawler = WebCrawler() crawler.warmup()
result = crawler.run( url=url, word_count_threshold=1, extraction_strategy= LLMExtractionStrategy( provider= "openai/gpt-4o", api_token = os.getenv( 'OPENAI_API_KEY'), schema=OpenAIModelFee.schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" ), bypass_cache=True, )
print(result.extracted_content)
On Wed, Jul 3, 2024 at 2:46β―PM UncleCode @.***> wrote:
@Ritha24 https://github.com/Ritha24, could you please share the code snippet? I am able to crawl the link but am unable to replicate your error. Additionally, could you debug the source code and indicate which line caused this issue?
image.png (view on web) https://github.com/unclecode/crawl4ai/assets/12494079/a63945f1-b400-4772-a638-3ee769aa3354
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/42#issuecomment-2205514522, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGMUFUJ3LETI5HZYGWIYEPTZKO6N7AVCNFSM6AAAAABKHNAHF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBVGUYTINJSGI . You are receiving this because you were mentioned.Message ID: @.***>
We already have a backlog item to add deep crawling. The plan is to extract a website sitemap and then extract all the links. Another version involves using a graph search algorithm to extract links based on a specific depth and a threshold to determine if enough information is collected. We'll release these in future versions.
Regarding the code, it seems okay and runs on different machines. I'm not sure why it's not working on your side. Do me a favor: if you're using this code in your debug configuration file, set the "justMyCode" flag to false to allow debugging of third-party libraries. Then, run the code and look at the error traceback. Share the traceback with me so I can identify which part of the library is causing the issue and backtrack it.
Right now when I execute your code, it works.
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
@Ritha24 I close this issue but you are welcome to continue your questions here.
here is the code : -
from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai import WebCrawler import together import litellm from litellm import completion
Initialize and warm up the crawler
crawler = WebCrawler() crawler.warmup()
Define the extraction strategy
strategy = LLMExtractionStrategy( provider='together_ai/togethercomputer/llama-2-70b-chat', api_token='api_key', instruction="extract all the reviews" )
Sample URL
url = "https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy"
Run the crawler with the extraction strategy
try: result = crawler.run(url=url, extraction_strategy=strategy)
Ensure the content is in UTF-8
except Exception as e: print(f"[ERROR] Failed to crawl {url}, error: {e}")
error : -
[LOG] π Initializing LocalSeleniumCrawlerStrategy [LOG] π€οΈ Warming up the WebCrawler Error retrieving cached URL: no such column: links [LOG] π WebCrawler is ready to crawl Error retrieving cached URL: no such column: links Checking page load: 0 Checking page load: 1 Checking page load: 2 Checking page load: 3 Checking page load: 4 Checking page load: 5 [ERROR] π« Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy, error: Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy: 'charmap' codec can't encode characters in position 15540-15544: character maps to
[ERROR] π« Failed to crawl https://www.myntra.com/trolley-bag/skybags/skybags-printed-soft-sided-large-trolley-suitcase/23933456/buy, error: No content extracted