unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
3.04k stars 250 forks source link

Unable to Scrape PDF URL #58

Closed nagendrakumar02 closed 2 months ago

nagendrakumar02 commented 2 months ago

I'm experiencing an issue where I'm unable to scrape a PDF URL using the [library/tool name]. The URL in question is https://www.myelectric.coop/wp-content/uploads/Electric-Vehicle-Charging-Equipment-Rebates.pdf.

Also, is there an example to use crawl4ai with Azure open AI?

Steps to Reproduce:

Attempt to scrape the PDF URL using the crawl4ai Observe that the scraping process fails or returns an error

Expected Behavior:

The crawl4ai should be able to successfully scrape the PDF URL and return the contents.

Actual Behavior:

The [library/tool name] is unable to scrape the PDF URL and returns an error or fails to complete the scraping process.

Error Message: """ Failed to crawl https://www.myelectric.coop/wp-content/uploads/Electric-Vehicle-Charging-Equipment-Rebates.pdf, error: can only concatenate str (not "NoneType") to str"""

Reproduction Code:

def fetch_with_crawl(url):

Create an instance of WebCrawler

crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url=url)

# Print the extracted content
# print(result.markdown)
return result.markdown

Let me know if you'd like me to add anything else to the issue!

unclecode commented 2 months ago

@nagendrakumar02 Currently, we do not support PDF, but it is on our backlog and will be available soon. Thank you for using the library.

NicoNicoNico123 commented 2 weeks ago

I have find some website return "error: can only concatenate str (not "NoneType") to str" but no clue, why? it's works in browser, here is the url I test 'https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/'