langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.63k stars 15.09k forks source link

Bug: FireCrawlLoader - Got exception due to failed crawl job but it was indeed a success #27063

Open bytrangle opened 2 weeks ago

bytrangle commented 2 weeks ago

Checked other resources

Example Code

from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
  url="https://firecrawl.dev",
  mode="crawl",
)
docs = loader.load()
docs[0]

I have set environment variable for FIRECRAWL_API_KEY

Error Message and Stack Trace (if applicable)

Traceback (most recent call last): File "/home/thoa/Documents/dev/demos/firecrawl/chat-with-website.py", line 23, in for doc in docs_lazy: File "/home/thoa/.local/lib/python3.10/site-packages/langchain_community/document_loaders/firecrawl.py", line 112, in lazy_load firecrawl_docs = self.firecrawl.crawl_url(self.url, params=self.params) File "/home/thoa/.local/lib/python3.10/site-packages/firecrawl/firecrawl.py", line 133, in crawl_url return self._monitor_job_status(id, headers, poll_interval) File "/home/thoa/.local/lib/python3.10/site-packages/firecrawl/firecrawl.py", line 360, in _monitor_job_status raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}') Exception: Crawl job failed or was stopped. Status: failed

Description

I'm trying to use FireCrawlLoader to crawl a website. I should get a printed out put like:

Document(metadata={'ogUrl': 'https://www.firecrawl.dev/', 'title': 'Home - Firecrawl', 'robots': 'follow, index', 'ogImage': 'https://www.firecrawl.dev/og.png?123', 'ogTitle': 'Firecrawl', 'sitemap': {'lastmod': '2024-08-12T00:28:16.681Z', 'changefreq': 'weekly'}, 'keywords': 'Firecrawl,Markdown,Data,Mendable,Langchain', 'sourceURL': 'https://www.firecrawl.dev/', 'ogSiteName': 'Firecrawl', 'description': 'Firecrawl crawls and converts any website into clean markdown.' ...)

Instead, I got an error that the crawl job failed or was stopped but I checked the Activity Logs in FireCrawl and the craw was a success.

The error can be traced to the function monitor_job_status in FireCrawl's Python SDK. I'm not sure if there is bug in FireCrawl integration in Langchain, or FireCrawl's Python SDK.

System Info

System Information

OS: Linux OS Version: #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 Python Version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

Package Information

langchain_core: 0.3.6 langchain: 0.3.1 langchain_community: 0.3.1 langsmith: 0.1.129 langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.6 async-timeout: 4.0.3 dataclasses-json: 0.6.7 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 orjson: 3.10.7 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.5.2 PyYAML: 5.4.1 requests: 2.25.1 SQLAlchemy: 2.0.35 tenacity: 8.5.0 typing-extensions: 4.12.2

FarhanChowdhury248 commented 7 hours ago

@bytrangle I believe this issue is due to https://github.com/mendableai/firecrawl/issues/720. You should be able to resolve it by using the workaround provided in the discussion or updating to include the associated PR fix. The former works in my case.