mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
14.3k stars 1.04k forks source link

scraping fails on self hosted instance #467

Closed rostwal95 closed 1 month ago

rostwal95 commented 1 month ago

Describe the Bug When running the firecrawl from local and trying to send the crawl request, it fails.

To Reproduce Steps to reproduce the issue:

  1. Configure the environment or settings with '...'
  2. Run the command '...'
  3. Observe the error or unexpected output at '...'
  4. Log output/error message

Expected Behavior crawling should have worked without any issues

Screenshots If applicable, add screenshots or copies of the command line output to help explain the issue.

Environment (please complete the following information):

Logs api-1 | [2024-07-27T15:32:57.358Z]INFO - For the Queue UI, open: http://0.0.0.0:3002/admin/@/queues api-1 | [2024-07-27T15:32:57.359Z]INFO - Connected to Redis Session Rate Limit Store! api-1 | [2024-07-27T15:36:35.628Z]WARN - You're bypassing authentication api-1 | [2024-07-27T15:36:35.629Z]WARN - You're bypassing authentication api-1 | [2024-07-27T15:36:35.639Z]ERROR - Attempted to access Supabase client when it's not configured. api-1 | [2024-07-27T15:36:35.640Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:35.646Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:35.647Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:37.074Z]INFO - ⛏️ Fire-Engine (tlsclient): Scraping https://timesofindia.indiatimes.com//sitemap.xml | params: { wait: 0, screenshot: false, method: null } worker-1 | [2024-07-27T15:36:37.481Z]INFO - ⛏️ Fire-Engine (tlsclient): Scraping https://timesofindia.indiatimes.com/sitemap.xml | params: { wait: 0, screenshot: false, method: null } worker-1 | [2024-07-27T15:36:37.492Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:37.492Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:38.574Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:38.575Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:38.772Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:38.772Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:43.264Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:43.265Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:43.265Z]ERROR - Attempted to access Supabase client when it's not configured. worker-1 | [2024-07-27T15:36:43.265Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. worker-1 | [2024-07-27T15:36:43.265Z]ERROR - Attempted to access Supabase client when it's not configured.

Additional Context Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.

rafaelsideguide commented 1 month ago

@rostwal95 Have you configured the .env file with USE_DB_AUTHENTICATION=false?

rostwal95 commented 1 month ago

yes I did, still I see the issue.

Below is my .env -

===== Required ENVS ======

NUM_WORKERS_PER_QUEUE=8 PORT=3002 HOST=0.0.0.0 REDIS_URL=redis://redis:6379 REDIS_RATE_LIMIT_URL=redis://localhost:6379 PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

To turn on DB authentication, you need to set up supabase.

USE_DB_AUTHENTICATION=false

docker-compose -

x-common-service: &common-service build: apps/api networks:

rafaelsideguide commented 1 month ago

could you try to replace the .env with the following:

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_URL=redis://redis:6379

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=

# Other Optionals
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
PLAYWRIGHT_MICROSERVICE_URL=  # set if you'd like to run a playwright fallback
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
POSTHOG_HOST= # set if you'd like to send posthog events like job logs

rebuild the containers and retry it? (make sure you're not using cached containers)

rostwal95 commented 1 month ago

Thank you Rafael for your reply and looking into this -

Used these env variables, rebuilt the containers and trying running it, still the same issue, below are the container logs after rebuilding -

2024-08-01 00:34:20 ! Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.6.0.tgz 2024-08-01 00:34:24 [2024-07-31T19:04:24.713Z]WARN - Authentication is disabled. Supabase client will not be initialized. 2024-08-01 00:35:05 [2024-07-31T19:05:05.149Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:05 [2024-07-31T19:05:05.150Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:34:22 2024-08-01 00:34:22 > firecrawl-scraper-js@1.0.0 workers /app 2024-08-01 00:34:22 > nodemon --exec ts-node src/services/queue-worker.ts 2024-08-01 00:34:22 2024-08-01 00:34:23 [nodemon] 2.0.22 2024-08-01 00:34:23 [nodemon] to restart at any time, enter rs 2024-08-01 00:34:23 [nodemon] watching path(s): . 2024-08-01 00:34:23 [nodemon] watching extensions: ts,json 2024-08-01 00:34:23 [nodemon] starting ts-node src/services/queue-worker.ts 2024-08-01 00:34:25 [2024-07-31T19:04:25.453Z]INFO - Web scraper queue created 2024-08-01 00:34:25 [2024-07-31T19:04:25.464Z]INFO - Connected to Redis Session Rate Limit Store! 2024-08-01 00:35:06 [2024-07-31T19:05:06.448Z]INFO - ⛏️ Fire-Engine (tlsclient): Scraping https://timesofindia.indiatimes.com//sitemap.xml | params: { wait: 0, screenshot: false, method: null } 2024-08-01 00:35:06 [2024-07-31T19:05:06.933Z]INFO - ⛏️ Fire-Engine (tlsclient): Scraping https://timesofindia.indiatimes.com/sitemap.xml | params: { wait: 0, screenshot: false, method: null } 2024-08-01 00:35:06 [2024-07-31T19:05:06.941Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:06 [2024-07-31T19:05:06.941Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:08 [2024-07-31T19:05:08.380Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:08 [2024-07-31T19:05:08.381Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:08 [2024-07-31T19:05:08.596Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:08 [2024-07-31T19:05:08.596Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.135Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Attempted to access Supabase client when it's not configured. 2024-08-01 00:35:11 [2024-07-31T19:05:11.136Z]ERROR - Error inserting scrape event: Error: Supabase client is not configured.

rafaelsideguide commented 1 month ago

@rostwal95, it seems to be working as expected. Are you able to retrieve the content when you post a scrape request?

rostwal95 commented 1 month ago

Thanks it works :) .. I am looking to host this on a kubernetes cluster ... The link mentioned in the documentation is broken. Could you please help me with any guide here ?

https://github.com/mendableai/firecrawl/blob/main/examples/kubernetes-cluster-install/README.md

Also are we supporting MFA ??? any pointers around this one.

rafaelsideguide commented 1 month ago

You can check the kubernetes example on: https://github.com/mendableai/firecrawl/tree/main/examples/kubernetes/cluster-install Regarding MFA: We do not support it yet. For scraping and crawling, you currently need to use authentication headers, as detailed in the scrape and crawl documentation.

As the issue is solved, I'm closing this for now.