mendableai / firecrawl

đŸ”¥ Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.39k stars 1.26k forks source link

[BUG] Firecrawl locally fails to scrape some websites #226

Closed SalamanderXing closed 4 months ago

SalamanderXing commented 4 months ago

I was trying to scrape these some websites and while if I try to use Firecrawl's online playground it succeeds, locally it stays stuck and the scraping job never finishes.

To Reproduce

Steps to reproduce the issue:

  1. follow https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
  2. use the default .env shown in that page (no authentication)
  3. These are the parameters I run it with. This website (https://www.campigliodolomiti.it) seems to be problematic, to the local version of firecrawl at least:
    "http://localhost:3002/v0/crawl",
    headers={
                "Content-Type": "application/json",
            },  
            json={
                "url": "https://www.campigliodolomiti.it",  
                "pageOptions": {
                    "onlyMainContent": true,
                    "limit": 10,
                    "includes": null,
                    "excludes": null,
                },
            }
  4. Soon firecrawl starts and keeps on printing out Scrapers in order: fetch forever
  5. Log output/error message
    
    All services started. Press Ctrl+C to stop.
    84536:C 03 Jun 2024 12:26:53.096 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
    84536:C 03 Jun 2024 12:26:53.096 * Redis version=7.2.5, bits=64, commit=00000000, modified=0, pid=84536, just started
    84536:C 03 Jun 2024 12:26:53.096 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
    84536:M 03 Jun 2024 12:26:53.096 * monotonic clock: POSIX clock_gettime
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 7.2.5 (00000000/0) 64 bit
    .-`` .-```.  ```\/    _.,_ ''-._                                  
    (    '      ,       .-`  | `,    )     Running in standalone mode
    |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
    |    `-._   `._    /     _.-'    |     PID: 84536
    `-._    `-._  `-./  _.-'    _.-'                                   
    |`-._`-._    `-.__.-'    _.-'_.-'|                                  
    |    `-._`-._        _.-'_.-'    |           https://redis.io       
    `-._    `-._`-.__.-'_.-'    _.-'                                   
    |`-._`-._    `-.__.-'    _.-'_.-'|                                  
    |    `-._`-._        _.-'_.-'    |                                  
    `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

84536:M 03 Jun 2024 12:26:53.097 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128. 84536:M 03 Jun 2024 12:26:53.098 Server initialized 84536:M 03 Jun 2024 12:26:53.098 Loading RDB produced by version 7.2.5 84536:M 03 Jun 2024 12:26:53.098 RDB age 3445 seconds 84536:M 03 Jun 2024 12:26:53.098 RDB memory usage when created 79.33 Mb 84536:M 03 Jun 2024 12:26:53.182 Done loading RDB, keys loaded: 126, keys expired: 0. 84536:M 03 Jun 2024 12:26:53.182 DB loaded from disk: 0.084 seconds 84536:M 03 Jun 2024 12:26:53.182 * Ready to accept connections tcp

firecrawl-scraper-js@1.0.0 workers /Users/salamanderxing/Documents/firecrawl/apps/api nodemon --exec ts-node src/services/queue-worker.ts

firecrawl-scraper-js@1.0.0 start:dev /Users/salamanderxing/Documents/firecrawl/apps/api nodemon --exec ts-node src/index.ts

[nodemon] 2.0.22 [nodemon] 2.0.22 [nodemon] to restart at any time, enter rs [nodemon] to restart at any time, enter rs [nodemon] watching path(s): . [nodemon] watching path(s): . [nodemon] watching extensions: ts,json [nodemon] watching extensions: ts,json [nodemon] starting ts-node src/services/queue-worker.ts [nodemon] starting ts-node src/index.ts LOGTAIL_KEY is not provided - your events will not be logged. Using MockLogtail as a fallback. see logtail.ts for more. Authentication is disabled. Supabase client will not be initialized. POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more. Authentication is disabled. Supabase client will not be initialized. Web scraper queue created (node:84588) [DEP0040] DeprecationWarning: The punycode module is deprecated. Please use a userland alternative instead. (Use node --trace-deprecation ... to show where the warning was created) POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more. Web scraper queue created (node:84587) [DEP0040] DeprecationWarning: The punycode module is deprecated. Please use a userland alternative instead. (Use node --trace-deprecation ... to show where the warning was created) Server listening on port 3002 For the UI, open http://0.0.0.0:3002/admin//queues

  1. Make sure Redis is running on port 6379 by default
  2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 limit=10 WARNING - You're bypassing authentication WARNING - You're bypassing authentication Attempted to access Supabase client when it's not configured. Error logging crawl job: Error: Supabase client is not configured. at Proxy. (/Users/salamanderxing/Documents/firecrawl/apps/api/src/services/supabase.ts:45:17) at logCrawl (/Users/salamanderxing/Documents/firecrawl/apps/api/src/services/logging/crawl_log.ts:7:8) at crawlController (/Users/salamanderxing/Documents/firecrawl/apps/api/src/controllers/crawl.ts:100:19) at processTicksAndRejections (node:internal/process/task_queues:95:5) {'jobId': 'ed61448e-e915-4d98-95af-9b68f95ca15e'} Started scrape job for https://www.campigliodolomiti.it/ with job_id ed61448e-e915-4d98-95af-9b68f95ca15e WARNING - You're bypassing authentication (node:84588) [DEP0174] DeprecationWarning: Calling promisify on a function that returns a Promise is likely a mistake. WARNING - You're bypassing authentication WARNING - You're bypassing authentication Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch 84536:M 03 Jun 2024 12:31:54.015 100 changes in 300 seconds. Saving... 84536:M 03 Jun 2024 12:31:54.023 Background saving started by pid 86499 86499:C 03 Jun 2024 12:31:54.260 DB saved on disk 86499:C 03 Jun 2024 12:31:54.260 Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB 84536:M 03 Jun 2024 12:31:54.327 * Background saving terminated with success Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch Scrapers in order: fetch ....

Expected Behavior Should be able to crawl like on the playground

Environment (please complete the following information):

nickscamara commented 4 months ago

Hey @SalamanderXing, thanks for opening the issue. We're investigating!

rafaelsideguide commented 4 months ago

Hey @SalamanderXing It looks like the playground works well because it uses multiple services and scraping methods with our API. On your setup, you might be using just one method, which is why it’s not pulling up the page.

You can use the Firecrawl API key on your self-hosted setup too. We have a free plan that gives you 500 credits. This should help you get the same results as the playground.

beydogan commented 3 months ago

@SalamanderXing have you solved the issue? I also have same problem with my self-hosted. Some jobs are basically getting stuck on "Active" queue forever. Also no idea how to delete them.

SalamanderXing commented 3 months ago

@beydogan I did not. @rafaelsideguide Can you explain more in detail your suggested solution? I was using the api key connected to my local firecrawl. But how come do I need credits for a local setup?

rafaelsideguide commented 3 months ago

@SalamanderXing The playground works well because it uses a mix of services and scraping methods, not just one. This setup helps bypass blockers on websites effectively. For your self-hosted setup, using the Firecrawl API key might help, as it includes access to these multiple services.

You can check out the getScrapingFallbackOrder function in single_url.ts under apps/api/src/scraper/WebScraper for more details on how this works.

beydogan commented 3 months ago

Rafael, please don't get me wrong, I appreciate the open source and self hosted version of the project, its super useful.

Why would someone need to self-host if they are going to pay for the API anyways?

SalamanderXing commented 3 months ago

I see, thank you for your reply @rafaelsideguide. Correct me if I'm wrong, the only service among these that is not "hostable" is scrapingbee, right? Is that the only service that is not available when i run firecrawl locally?

@beydogan That might also answer your question. Someone needs to pay scrapingbee.