mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
18.3k stars 1.38k forks source link

No data return for some JavaScript-rendered websites. #543

Open Z-ZHHH opened 2 months ago

Z-ZHHH commented 2 months ago

Could someone help with this? I used the Python SDK as follows

import requests

url = "http://localhost:3002/v0/crawl"

payload = {
    "url": "https://news.qq.com/ch/world",
    "crawlerOptions": {
        "generateImgAltText": False,
        "returnOnlyUrls": False,
        "maxDepth": 3,
        "mode": "default",
        "limit": 999,
        "allowBackwardCrawling": True,
        "allowExternalContentLinks": True
    },
    "pageOptions": {
        "headers": {},
        "includeHtml": True,
        "includeRawHtml": True,
        "replaceAllPathsWithAbsolutePaths": True,
        "waitFor": 300
    }
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)

jobId = response.json()["jobId"]

url = f"http://localhost:3002/v0/crawl/status/{jobId}"

response = requests.request("GET", url, headers=headers)
print(response.text)

It returns

{"jobId":"e87c4cf6-fd4a-4bb0-88c2-501de2d160c3"}
{"status":"completed","current":1,"total":1,"data":[],"partial_data":[]}

But when I tried some other website like https://new.qq.com/rain/a/20240814A086V800, it could return the HTML data correctly. Did I miss some settings? Thanks a lot for help.

rafaelsideguide commented 2 months ago

@Z-ZHHH can you try with a higher waitFor? like 5000? Usually javascript rendered pages takes more than 3 seconds to load.

Z-ZHHH commented 2 months ago

Thanks for your quick reply. I have changed the waitFor to 15000ms, it still return

{'status': 'completed', 'current': 1, 'total': 1, 'data': [], 'partial_data': []}

Could this be due to blocking on the site https://news.qq.com/ch/world?