Closed SalamanderXing closed 4 months ago
Hey @SalamanderXing, thanks for opening the issue. We're investigating!
Hey @SalamanderXing It looks like the playground works well because it uses multiple services and scraping methods with our API. On your setup, you might be using just one method, which is why it’s not pulling up the page.
You can use the Firecrawl API key on your self-hosted setup too. We have a free plan that gives you 500 credits. This should help you get the same results as the playground.
@SalamanderXing have you solved the issue? I also have same problem with my self-hosted. Some jobs are basically getting stuck on "Active" queue forever. Also no idea how to delete them.
@beydogan I did not. @rafaelsideguide Can you explain more in detail your suggested solution? I was using the api key connected to my local firecrawl. But how come do I need credits for a local setup?
@SalamanderXing The playground works well because it uses a mix of services and scraping methods, not just one. This setup helps bypass blockers on websites effectively. For your self-hosted setup, using the Firecrawl API key might help, as it includes access to these multiple services.
You can check out the getScrapingFallbackOrder
function in single_url.ts
under apps/api/src/scraper/WebScraper
for more details on how this works.
Rafael, please don't get me wrong, I appreciate the open source and self hosted version of the project, its super useful.
Why would someone need to self-host if they are going to pay for the API anyways?
I see, thank you for your reply @rafaelsideguide. Correct me if I'm wrong, the only service among these that is not "hostable" is scrapingbee, right? Is that the only service that is not available when i run firecrawl locally?
@beydogan That might also answer your question. Someone needs to pay scrapingbee.
I was trying to scrape these some websites and while if I try to use Firecrawl's online playground it succeeds, locally it stays stuck and the scraping job never finishes.
To Reproduce
Steps to reproduce the issue:
.env
shown in that page (no authentication)Scrapers in order: fetch
forever84536:M 03 Jun 2024 12:26:53.097 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128. 84536:M 03 Jun 2024 12:26:53.098 Server initialized 84536:M 03 Jun 2024 12:26:53.098 Loading RDB produced by version 7.2.5 84536:M 03 Jun 2024 12:26:53.098 RDB age 3445 seconds 84536:M 03 Jun 2024 12:26:53.098 RDB memory usage when created 79.33 Mb 84536:M 03 Jun 2024 12:26:53.182 Done loading RDB, keys loaded: 126, keys expired: 0. 84536:M 03 Jun 2024 12:26:53.182 DB loaded from disk: 0.084 seconds 84536:M 03 Jun 2024 12:26:53.182 * Ready to accept connections tcp
[nodemon] 2.0.22 [nodemon] 2.0.22 [nodemon] to restart at any time, enter
rs
[nodemon] to restart at any time, enterrs
[nodemon] watching path(s): . [nodemon] watching path(s): . [nodemon] watching extensions: ts,json [nodemon] watching extensions: ts,json [nodemon] startingts-node src/services/queue-worker.ts
[nodemon] startingts-node src/index.ts
LOGTAIL_KEY is not provided - your events will not be logged. Using MockLogtail as a fallback. see logtail.ts for more. Authentication is disabled. Supabase client will not be initialized. POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more. Authentication is disabled. Supabase client will not be initialized. Web scraper queue created (node:84588) [DEP0040] DeprecationWarning: Thepunycode
module is deprecated. Please use a userland alternative instead. (Usenode --trace-deprecation ...
to show where the warning was created) POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more. Web scraper queue created (node:84587) [DEP0040] DeprecationWarning: Thepunycode
module is deprecated. Please use a userland alternative instead. (Usenode --trace-deprecation ...
to show where the warning was created) Server listening on port 3002 For the UI, open http://0.0.0.0:3002/admin//queuesExpected Behavior Should be able to crawl like on the playground
Environment (please complete the following information):