Open cxycxm opened 4 days ago
To resolve the issue where firecrawl_reader
cannot read web page content correctly due to a Baidu security verification page, you can use the params
argument to pass specific parameters to the Firecrawl API that might help bypass the security verification. For example, you can set user-agent headers or other options that mimic a real browser.
Here is an example of how you can modify the params
to include a user-agent:
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
# Initialize the FireCrawlWebReader with your API key and custom parameters
reader = FireCrawlWebReader(
api_key="your_api_key",
mode="scrape", # or "crawl" depending on your needs
params={
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
}
)
# Load data from the URL
documents = reader.load_data(url="https://baijiahao.baidu.com/s?id=1802435789639076917&wfr=spider&for=pc")
print(documents)
This example sets a custom User-Agent
header to mimic a real browser, which might help in bypassing the security verification page. Adjust the params
as needed based on the specific requirements of the Baidu security verification [1].
To continue talking to Dosu, mention @dosu.
documents.append( Document( page_content=firecrawl_docs.get("markdown", ""), metadata=firecrawl_docs.get("metadata", {}), ) )
Should it be changed to the following so that it can be used normally?
documents.append( Document( text=firecrawl_docs.get("markdown", ""), metadata=firecrawl_docs.get("metadata", {}), ) )
Yes, you should change the page_content
parameter to text
in the Document
constructor to ensure that the documents are appended correctly and can be used normally. The correct attribute to use is text
.
Here is the corrected code:
documents.append(
Document(
text=firecrawl_docs.get("markdown", ""),
metadata=firecrawl_docs.get("metadata", {}),
)
)
This change should resolve any issues and ensure that the documents are appended correctly [1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
firecrawl_reader cannot be used normally and cannot read the web page content correctly. example: firecrawl_reader = FireCrawlWebReader( api_key="api_key", # Replace with your actual API key from https://www.firecrawl.dev/ mode="scrape", # Choose between "crawl" and "scrape" for single page scraping params={"additional": "parameters"}, # Optional additional parameters )
Load documents from a single page URL
documents = firecrawl_reader.load_data(url="https://baijiahao.baidu.com/s?id=1802435789639076917&wfr=spider&for=pc") print(documents)