mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
7.47k stars 545 forks source link

[BUG] Self Host exit with code 137 ELIFECYCLE #281

Open hugomoran159 opened 3 weeks ago

hugomoran159 commented 3 weeks ago

Describe the Bug Provide a clear and concise description of what the bug is. When using SELF HOST, worker-1 exists with error code 137 ELIFECYCLE when crawling https://www.thequays.co.uk This issue does not occur for other URLs. firecrawl-py used to make api call to localhost. firecrawl.py file modified to remove api_key requirement.

To Reproduce Steps to reproduce the issue:

  1. Configure the environment or settings with '...' Run firecrawl locally with .env
    
    # ===== Required ENVS ======
    NUM_WORKERS_PER_QUEUE=8 
    PORT=3002
    HOST=0.0.0.0
    REDIS_URL=redis://redis:6379
    PLAYWRIGHT_MICROSERVICE_URL=http://[::]:3000/html

To turn on DB authentication, you need to set up supabase.

USE_DB_AUTHENTICATION=false

===== Optional ENVS ======

Supabase Setup (used to support DB authentication, advanced logging, etc.)

SUPABASE_ANON_TOKEN= SUPABASE_URL= SUPABASE_SERVICE_TOKEN=

Other Optionals

TEST_API_KEY= # use if you've set up authentication and want to test with a real API key RATE_LIMIT_TEST_API_KEY_SCRAPE= # set if you'd like to test the scraping rate limit RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.) BULL_AUTH_KEY= @ LOGTAIL_KEY= # Use if you're configuring basic logging with logtail LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs POSTHOG_HOST= # set if you'd like to send posthog events like job logs

STRIPE_PRICE_ID_STANDARD= STRIPE_PRICE_ID_SCALE= STRIPE_PRICE_ID_STARTER= STRIPE_PRICE_ID_HOBBY= STRIPE_PRICE_ID_HOBBY_YEARLY= STRIPE_PRICE_ID_STANDARD_NEW= STRIPE_PRICE_ID_STANDARD_NEW_YEARLY= STRIPE_PRICE_ID_GROWTH= STRIPE_PRICE_ID_GROWTH_YEARLY=

HYPERDX_API_KEY= HDX_NODE_BETA_MODE=1

FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta

Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)

PROXY_SERVER= PROXY_USERNAME= PROXY_PASSWORD=

set if you'd like to block media requests to save proxy bandwidth

BLOCK_MEDIA=

Set this to the URL of your webhook when using the self-hosted version of FireCrawl

SELF_HOSTED_WEBHOOK_URL=

Resend API Key for transactional emails

RESEND_API_KEY=


3. Run the command 
`firecrawl.crawl_url(url='www.thequays.co.uk', params={"pageOptions": {"onlyMainContent": True}})`

7. Log output/error message
```bash
worker-1              | [Playwright] Error fetching url: https://thequays.co.uk/media/uploads/lush4.jpg -> Error: getaddrinfo ENOTFOUND [::]
worker-1              | Falling back to scrapingBeeLoad
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush3.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush4.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush1.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | Killed
worker-1              |  ELIFECYCLE  Command failed with exit code 137.
worker-1 exited with code 137
redis-1               | 1:M 14 Jun 2024 14:15:01.044 * 100 changes in 300 seconds. Saving...
redis-1               | 1:M 14 Jun 2024 14:15:01.055 * Background saving started by pid 44
redis-1               | 44:C 14 Jun 2024 14:15:07.079 * DB saved on disk
redis-1               | 44:C 14 Jun 2024 14:15:07.080 * Fork CoW for RDB: current 3 MB, peak 3 MB, average 2 MB
redis-1               | 1:M 14 Jun 2024 14:15:07.099 * Background saving terminated with success`

Expected Behavior Crawler finishes and outputs content json. Worker-1 does not exit. Next api request can be processed.

Environment (please complete the following information):

Also note - Self Host playwright path in .env.example needs to be changed to http://[::]:3000/html

mattjoyce commented 3 weeks ago

Scrape metadata result for me. {'title': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'description': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'keywords': 'The Quays Shopping Centre Newry Northern Ireland 2022', 'robots': 'ALL', 'ogTitle': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'ogDescription': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'ogUrl': 'thequays.co.uk', 'ogImage': '~/images/general/logo.png', 'ogLocaleAlternate': [], 'sourceURL': 'https://www.thequays.co.uk/'}

worker did not crash. using fresh main

mattjoyce commented 3 weeks ago

/Crawl on the other hand does not work. Playwright is timing out or crashing, scrape falls back to Fetch, after a long time it returns a very large payload.

mattjoyce commented 3 weeks ago
playwright-service-1  | [2024-06-14 22:22:00 +0000] [9] [ERROR] Error in ASGI Framework
...
playwright-service-1  | playwright._impl._errors.TimeoutError: Timeout 15000ms exceeded.