mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.39k stars 1.26k forks source link

[BUG] The pageStatusCode is 200, but no content is returned #385

Closed dgedanke closed 2 months ago

dgedanke commented 3 months ago

Describe the Bug

I'm testing an internal website, it probably used ajax to load the web page. The pageStatusCode is 200, but no content is returned.

To Reproduce

This is body:

{
      "url": "http://127.0.0.1:16066/#/category/index",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      },
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 2,
        "mode": "fast"
      }
}

Expected Behavior

Here is the result:

{
    "success": true,
    "error": "No page found",
    "returnCode": 200,
    "data": {
        "content": "",
        "markdown": "",
        "html": "",
        "metadata": {
            "sourceURL": "http://127.0.0.1:16066/#/category/index",
            "pageStatusCode": 200
        }
    }
}

Environment:

I tested a lot of sites and got results on most of them. However, some websites that might use ajax have a similar situation as above. The status code is 200, but there is no content, "no page found", not even html returned.

rafaelsideguide commented 3 months ago

@dgedanke are you using docker? If so, I think the problem might be that there's no network wrapping your webpage service with firecrawl containers.

dgedanke commented 3 months ago

Thank you for your response and I will provide more details!

I used Firecrawl's dockers container.

This is its log:

$ docker logs firecrawl-api-1

Output:

// ...
   at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logScrape (/app/dist/src/services/logging/scrape_log.js:18:67)
    at scrapWithFetch (/app/dist/src/scraper/WebScraper/scrapers/fetch.js:76:42)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async attemptScraping (/app/dist/src/scraper/WebScraper/single_url.js:168:34)
    at async scrapSingleUrl (/app/dist/src/scraper/WebScraper/single_url.js:242:29)
    at async /app/dist/src/scraper/WebScraper/index.js:65:32
    at async Promise.all (index 0)
    at async WebScraperDataProvider.convertUrlsToDocuments (/app/dist/src/scraper/WebScraper/index.js:63:13)
    at async WebScraperDataProvider.processLinks (/app/dist/src/scraper/WebScraper/index.js:200:25)
    at async WebScraperDataProvider.handleSingleUrlsMode (/app/dist/src/scraper/WebScraper/index.js:168:25)
    at async scrapeHelper (/app/dist/src/controllers/scrape.js:34:16)
    at async scrapeController (/app/dist/src/controllers/scrape.js:96:24)

Error: Error: All scraping methods failed for URL: http://10.10.10.82:16066/#/category/index - Failed to fetch URL: http://10.10.10.82:16066/#/category/index

Now,entry this container:

$ docker exec -it firecrawl-api-1 /usr/bin/bash
root@1a512497dad6:/app# curl -X GET http://10.10.10.82:16066/#/index

Output:

<!DOCTYPE html>
<html>

<head>
    <meta charset=utf-8>
    <meta http-equiv=Cache-Control content="no-cache, no-store, must-revalidate">
    <meta http-equiv=Pragma content=no-cache>
    <meta http-equiv=Expires content=0>
    <meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1">
    <meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no">
    <link rel=icon href=favicon.ico>
    <title>title</title>
    <link rel=stylesheet href=font/iconfont.css>
    <link href=static/css/chunk-elementUI.bad1c2fe.css rel=stylesheet>
    <link href=static/css/chunk-libs.faed0c3c.css rel=stylesheet>
    <link href=static/css/app.09f82ccb.css rel=stylesheet>
</head>

<body><noscript><strong>We're sorry but title doesn't work properly without JavaScript enabled. Please enable it to
            continue.</strong></noscript>
    <div id=app></div>
    <script src=static/js/chunk-elementUI.ea028b12.js></script>
    <script src=static/js/chunk-libs.d4413573.js></script>
    <script>
        // js code
    </script>
    <script src=static/js/app.0f34af00.js></script>
</body>

</html>

This is an internal site, but in firecrawl-api-1, I was able to access it using curl, even though it didn't return valuable content.

If I visit a public website, like https://www.tripadvisor.com/TravelersChoice I also use postman, this is response:

{
    "success": true,
    "error": "No page found",
    "returnCode": 200,
    "data": {
        "content": "",
        "markdown": "",
        "html": "",
        "metadata": {
            "sourceURL": "https://www.tripadvisor.com/TravelersChoice",
            "pageStatusCode": 401,
            "pageError": "UNAUTHORIZED"
        }
    }
}

Also, in firecrawl-api-1

root@1a512497dad6:/app# curl -X GET https://www.tripadvisor.com/TravelersChoice

Output:

<html>

<head>
    <title>tripadvisor.com</title>
    <style>
        #cmsg {
            animation: A 1.5s;
        }

        @keyframes A {
            0% {
                opacity: 0;
            }

            99% {
                opacity: 0;
            }

            100% {
                opacity: 1;
            }
        }
    </style>
</head>

<body style="margin:0">
    <p id="cmsg">Please enable JS and disable any ad blocker</p>
    <script
        data-cfasync="false">var dd = { 'rt': 'c', 'cid': 'AHrlqAAAAAMAAWals3QFyJoA2kt4zg==', 'hsh': '2F05D671381DB06BEE4CC52C7A6FD3', 't': 'fe', 's': 46694, 'e': '39dd036d7d021d22e968704232f14435f3733df428f456a0e2c2272a20cafc33', 'host': 'geo.captcha-delivery.com' }</script>
    <script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script>
</body>

</html>

It's logs:

$ docker logs firecrawl-api-1
// ...
Error logging proxy:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logScrape (/app/dist/src/services/logging/scrape_log.js:18:67)
    at scrapWithFetch (/app/dist/src/scraper/WebScraper/scrapers/fetch.js:76:42)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async attemptScraping (/app/dist/src/scraper/WebScraper/single_url.js:168:34)
    at async scrapSingleUrl (/app/dist/src/scraper/WebScraper/single_url.js:242:29)
    at async /app/dist/src/scraper/WebScraper/index.js:65:32
    at async Promise.all (index 0)
    at async WebScraperDataProvider.convertUrlsToDocuments (/app/dist/src/scraper/WebScraper/index.js:63:13)
    at async WebScraperDataProvider.processLinks (/app/dist/src/scraper/WebScraper/index.js:200:25)
    at async WebScraperDataProvider.handleSingleUrlsMode (/app/dist/src/scraper/WebScraper/index.js:168:25)
    at async scrapeHelper (/app/dist/src/controllers/scrape.js:34:16)
    at async scrapeController (/app/dist/src/controllers/scrape.js:96:24)
Error: Error: All scraping methods failed for URL: https://www.tripadvisor.com/TravelersChoice - Failed to fetch URL: https://www.tripadvisor.com/TravelersChoice

They have the same error. In fact, if I use playwright, it also appears in Firecrawl. At this point, I can get the content of the web page, which is different from the result obtained by crul.

In general, your framework is very good and it has helped me a lot! What puzzles me is that when you work with these sites, even if their content is not easily accessible, you should return at least some of the web source code.

Wizmak9 commented 3 months ago

It can scrap those sites when using firecrawl web but gets fails when used docker please provide solution for the same.

rafaelsideguide commented 2 months ago

@Wizmak9 I implemented a fix yesterday that should solve this bug.

Wizmak9 commented 2 months ago

@rafaelsideguide thanks a lot buddy let me try the fix. can you share me the fix PR. so i can take pull accordingly.

rafaelsideguide commented 2 months ago

@Wizmak9 the fix is already in the main

rafaelsideguide commented 2 months ago

Closing this as it should be fixed now. @Wizmak9, please let me know if you're still experiencing this error.