hugomoran159 commented 3 weeks ago

Describe the Bug Provide a clear and concise description of what the bug is. When using SELF HOST, worker-1 exists with error code 137 ELIFECYCLE when crawling https://www.thequays.co.uk This issue does not occur for other URLs. firecrawl-py used to make api call to localhost. firecrawl.py file modified to remove api_key requirement.

To Reproduce Steps to reproduce the issue:

Configure the environment or settings with '...' Run firecrawl locally with .env


# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://[::]:3000/html

To turn on DB authentication, you need to set up supabase.

USE_DB_AUTHENTICATION=false

===== Optional ENVS ======

Supabase Setup (used to support DB authentication, advanced logging, etc.)

SUPABASE_ANON_TOKEN= SUPABASE_URL= SUPABASE_SERVICE_TOKEN=

Other Optionals

TEST_API_KEY= # use if you've set up authentication and want to test with a real API key RATE_LIMIT_TEST_API_KEY_SCRAPE= # set if you'd like to test the scraping rate limit RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.) BULL_AUTH_KEY= @ LOGTAIL_KEY= # Use if you're configuring basic logging with logtail LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs POSTHOG_HOST= # set if you'd like to send posthog events like job logs

STRIPE_PRICE_ID_STANDARD= STRIPE_PRICE_ID_SCALE= STRIPE_PRICE_ID_STARTER= STRIPE_PRICE_ID_HOBBY= STRIPE_PRICE_ID_HOBBY_YEARLY= STRIPE_PRICE_ID_STANDARD_NEW= STRIPE_PRICE_ID_STANDARD_NEW_YEARLY= STRIPE_PRICE_ID_GROWTH= STRIPE_PRICE_ID_GROWTH_YEARLY=

HYPERDX_API_KEY= HDX_NODE_BETA_MODE=1

FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta

Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)

PROXY_SERVER= PROXY_USERNAME= PROXY_PASSWORD=

set if you'd like to block media requests to save proxy bandwidth

BLOCK_MEDIA=

Set this to the URL of your webhook when using the self-hosted version of FireCrawl

SELF_HOSTED_WEBHOOK_URL=

Resend API Key for transactional emails

RESEND_API_KEY=


3. Run the command 
`firecrawl.crawl_url(url='www.thequays.co.uk', params={"pageOptions": {"onlyMainContent": True}})`

7. Log output/error message
```bash
worker-1              | [Playwright] Error fetching url: https://thequays.co.uk/media/uploads/lush4.jpg -> Error: getaddrinfo ENOTFOUND [::]
worker-1              | Falling back to scrapingBeeLoad
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush3.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush4.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | [ScrapingBee][c] Error fetching url: https://thequays.co.uk/media/uploads/lush1.jpg -> AxiosError: Request failed with status code 401
worker-1              | Falling back to fetch
worker-1              | Killed
worker-1              |  ELIFECYCLE  Command failed with exit code 137.
worker-1 exited with code 137
redis-1               | 1:M 14 Jun 2024 14:15:01.044 * 100 changes in 300 seconds. Saving...
redis-1               | 1:M 14 Jun 2024 14:15:01.055 * Background saving started by pid 44
redis-1               | 44:C 14 Jun 2024 14:15:07.079 * DB saved on disk
redis-1               | 44:C 14 Jun 2024 14:15:07.080 * Fork CoW for RDB: current 3 MB, peak 3 MB, average 2 MB
redis-1               | 1:M 14 Jun 2024 14:15:07.099 * Background saving terminated with success`

Expected Behavior Crawler finishes and outputs content json. Worker-1 does not exit. Next api request can be processed.

Environment (please complete the following information):

macOS
Firecrawl Version: 3.9
Node.js Version: node:20-slim

Also note - Self Host playwright path in .env.example needs to be changed to http://[::]:3000/html

mattjoyce commented 3 weeks ago

Scrape metadata result for me. {'title': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'description': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'keywords': 'The Quays Shopping Centre Newry Northern Ireland 2022', 'robots': 'ALL', 'ogTitle': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'ogDescription': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'ogUrl': 'thequays.co.uk', 'ogImage': '~/images/general/logo.png', 'ogLocaleAlternate': [], 'sourceURL': 'https://www.thequays.co.uk/'}

worker did not crash. using fresh main

mattjoyce commented 3 weeks ago

/Crawl on the other hand does not work. Playwright is timing out or crashing, scrape falls back to Fetch, after a long time it returns a very large payload.

mattjoyce commented 3 weeks ago

playwright-service-1  | [2024-06-14 22:22:00 +0000] [9] [ERROR] Error in ASGI Framework
...
playwright-service-1  | playwright._impl._errors.TimeoutError: Timeout 15000ms exceeded.

mendableai / firecrawl

[BUG] Self Host exit with code 137 ELIFECYCLE #281

To turn on DB authentication, you need to set up supabase.

===== Optional ENVS ======

Supabase Setup (used to support DB authentication, advanced logging, etc.)

Other Optionals

Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)

set if you'd like to block media requests to save proxy bandwidth

Set this to the URL of your webhook when using the self-hosted version of FireCrawl

Resend API Key for transactional emails