Closed dgedanke closed 2 months ago
@dgedanke are you using docker? If so, I think the problem might be that there's no network wrapping your webpage service with firecrawl containers.
Thank you for your response and I will provide more details!
I used Firecrawl's dockers container.
This is its log:
$ docker logs firecrawl-api-1
Output:
// ...
at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
at logScrape (/app/dist/src/services/logging/scrape_log.js:18:67)
at scrapWithFetch (/app/dist/src/scraper/WebScraper/scrapers/fetch.js:76:42)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async attemptScraping (/app/dist/src/scraper/WebScraper/single_url.js:168:34)
at async scrapSingleUrl (/app/dist/src/scraper/WebScraper/single_url.js:242:29)
at async /app/dist/src/scraper/WebScraper/index.js:65:32
at async Promise.all (index 0)
at async WebScraperDataProvider.convertUrlsToDocuments (/app/dist/src/scraper/WebScraper/index.js:63:13)
at async WebScraperDataProvider.processLinks (/app/dist/src/scraper/WebScraper/index.js:200:25)
at async WebScraperDataProvider.handleSingleUrlsMode (/app/dist/src/scraper/WebScraper/index.js:168:25)
at async scrapeHelper (/app/dist/src/controllers/scrape.js:34:16)
at async scrapeController (/app/dist/src/controllers/scrape.js:96:24)
Error: Error: All scraping methods failed for URL: http://10.10.10.82:16066/#/category/index - Failed to fetch URL: http://10.10.10.82:16066/#/category/index
Now,entry this container:
$ docker exec -it firecrawl-api-1 /usr/bin/bash
root@1a512497dad6:/app# curl -X GET http://10.10.10.82:16066/#/index
Output:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<meta http-equiv=Cache-Control content="no-cache, no-store, must-revalidate">
<meta http-equiv=Pragma content=no-cache>
<meta http-equiv=Expires content=0>
<meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1">
<meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no">
<link rel=icon href=favicon.ico>
<title>title</title>
<link rel=stylesheet href=font/iconfont.css>
<link href=static/css/chunk-elementUI.bad1c2fe.css rel=stylesheet>
<link href=static/css/chunk-libs.faed0c3c.css rel=stylesheet>
<link href=static/css/app.09f82ccb.css rel=stylesheet>
</head>
<body><noscript><strong>We're sorry but title doesn't work properly without JavaScript enabled. Please enable it to
continue.</strong></noscript>
<div id=app></div>
<script src=static/js/chunk-elementUI.ea028b12.js></script>
<script src=static/js/chunk-libs.d4413573.js></script>
<script>
// js code
</script>
<script src=static/js/app.0f34af00.js></script>
</body>
</html>
This is an internal site, but in firecrawl-api-1, I was able to access it using curl, even though it didn't return valuable content.
If I visit a public website, like https://www.tripadvisor.com/TravelersChoice I also use postman, this is response:
{
"success": true,
"error": "No page found",
"returnCode": 200,
"data": {
"content": "",
"markdown": "",
"html": "",
"metadata": {
"sourceURL": "https://www.tripadvisor.com/TravelersChoice",
"pageStatusCode": 401,
"pageError": "UNAUTHORIZED"
}
}
}
Also, in firecrawl-api-1
root@1a512497dad6:/app# curl -X GET https://www.tripadvisor.com/TravelersChoice
Output:
<html>
<head>
<title>tripadvisor.com</title>
<style>
#cmsg {
animation: A 1.5s;
}
@keyframes A {
0% {
opacity: 0;
}
99% {
opacity: 0;
}
100% {
opacity: 1;
}
}
</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script
data-cfasync="false">var dd = { 'rt': 'c', 'cid': 'AHrlqAAAAAMAAWals3QFyJoA2kt4zg==', 'hsh': '2F05D671381DB06BEE4CC52C7A6FD3', 't': 'fe', 's': 46694, 'e': '39dd036d7d021d22e968704232f14435f3733df428f456a0e2c2272a20cafc33', 'host': 'geo.captcha-delivery.com' }</script>
<script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
It's logs:
$ docker logs firecrawl-api-1
// ...
Error logging proxy:
Error: Supabase client is not configured.
at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
at logScrape (/app/dist/src/services/logging/scrape_log.js:18:67)
at scrapWithFetch (/app/dist/src/scraper/WebScraper/scrapers/fetch.js:76:42)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async attemptScraping (/app/dist/src/scraper/WebScraper/single_url.js:168:34)
at async scrapSingleUrl (/app/dist/src/scraper/WebScraper/single_url.js:242:29)
at async /app/dist/src/scraper/WebScraper/index.js:65:32
at async Promise.all (index 0)
at async WebScraperDataProvider.convertUrlsToDocuments (/app/dist/src/scraper/WebScraper/index.js:63:13)
at async WebScraperDataProvider.processLinks (/app/dist/src/scraper/WebScraper/index.js:200:25)
at async WebScraperDataProvider.handleSingleUrlsMode (/app/dist/src/scraper/WebScraper/index.js:168:25)
at async scrapeHelper (/app/dist/src/controllers/scrape.js:34:16)
at async scrapeController (/app/dist/src/controllers/scrape.js:96:24)
Error: Error: All scraping methods failed for URL: https://www.tripadvisor.com/TravelersChoice - Failed to fetch URL: https://www.tripadvisor.com/TravelersChoice
They have the same error. In fact, if I use playwright, it also appears in Firecrawl. At this point, I can get the content of the web page, which is different from the result obtained by crul.
In general, your framework is very good and it has helped me a lot! What puzzles me is that when you work with these sites, even if their content is not easily accessible, you should return at least some of the web source code.
It can scrap those sites when using firecrawl web but gets fails when used docker please provide solution for the same.
@Wizmak9 I implemented a fix yesterday that should solve this bug.
@rafaelsideguide thanks a lot buddy let me try the fix. can you share me the fix PR. so i can take pull accordingly.
@Wizmak9 the fix is already in the main
Closing this as it should be fixed now. @Wizmak9, please let me know if you're still experiencing this error.
Describe the Bug
I'm testing an internal website, it probably used ajax to load the web page. The pageStatusCode is 200, but no content is returned.
To Reproduce
This is body:
Expected Behavior
Here is the result:
Environment:
I tested a lot of sites and got results on most of them. However, some websites that might use ajax have a similar situation as above. The status code is 200, but there is no content, "no page found", not even html returned.