run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.1k stars 4.61k forks source link

[Question]: firecrawl_reader error #14349

Open cxycxm opened 4 days ago

cxycxm commented 4 days ago

Question Validation

Question

firecrawl_reader cannot be used normally and cannot read the web page content correctly. example: firecrawl_reader = FireCrawlWebReader( api_key="api_key", # Replace with your actual API key from https://www.firecrawl.dev/ mode="scrape", # Choose between "crawl" and "scrape" for single page scraping params={"additional": "parameters"}, # Optional additional parameters )

Load documents from a single page URL

documents = firecrawl_reader.load_data(url="https://baijiahao.baidu.com/s?id=1802435789639076917&wfr=spider&for=pc") print(documents)

dosubot[bot] commented 4 days ago

To resolve the issue where firecrawl_reader cannot read web page content correctly due to a Baidu security verification page, you can use the params argument to pass specific parameters to the Firecrawl API that might help bypass the security verification. For example, you can set user-agent headers or other options that mimic a real browser.

Here is an example of how you can modify the params to include a user-agent:

from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader

# Initialize the FireCrawlWebReader with your API key and custom parameters
reader = FireCrawlWebReader(
    api_key="your_api_key",
    mode="scrape",  # or "crawl" depending on your needs
    params={
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
        }
    }
)

# Load data from the URL
documents = reader.load_data(url="https://baijiahao.baidu.com/s?id=1802435789639076917&wfr=spider&for=pc")
print(documents)

This example sets a custom User-Agent header to mimic a real browser, which might help in bypassing the security verification page. Adjust the params as needed based on the specific requirements of the Baidu security verification [1].

To continue talking to Dosu, mention @dosu.

cxycxm commented 4 days ago

documents.append( Document( page_content=firecrawl_docs.get("markdown", ""), metadata=firecrawl_docs.get("metadata", {}), ) )

Should it be changed to the following so that it can be used normally?

documents.append( Document( text=firecrawl_docs.get("markdown", ""), metadata=firecrawl_docs.get("metadata", {}), ) )

dosubot[bot] commented 4 days ago

Yes, you should change the page_content parameter to text in the Document constructor to ensure that the documents are appended correctly and can be used normally. The correct attribute to use is text.

Here is the corrected code:

documents.append(
    Document(
        text=firecrawl_docs.get("markdown", ""),
        metadata=firecrawl_docs.get("metadata", {}),
    )
)

This change should resolve any issues and ensure that the documents are appended correctly [1][2].

To continue talking to Dosu, mention @dosu.