mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
15.28k stars 1.11k forks source link

After I deployed the project locally, he couldn't crawl the webpage and kept waiting #340

Closed ahasasjeb closed 2 months ago

ahasasjeb commented 3 months ago

I used Docker to deploy the project and sent two URLs through Python CDK, but the UI interface kept showing waiting

! Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.4.0.tgz

> firecrawl-scraper-js@1.0.0 start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Number of CPUs: 2 available
Master 32 is running
Connected to Redis Session Store!
Authentication is disabled. Supabase client will not be initialized.
Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Number of CPUs: 2 available
Web scraper queue created
Worker 39 started
Worker 39 listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
Connected to Redis Session Store!
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Number of CPUs: 2 available
Web scraper queue created
Worker 41 started
Worker 41 listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
Connected to Redis Session Store!
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://mendable.ai -> AxiosError: Request failed with status code 404
Falling back to fetch
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:86:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://mendable.ai -> AxiosError: Request failed with status code 404
Falling back to fetch
[Playwright] Error fetching url: https://mendable.ai -> AxiosError: Request failed with status code 404
Falling back to fetch
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:86:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

image

leo8198 commented 3 months ago

Set the env variable PLAYWRIGHT_MICROSERVICE_URL to PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html should solve the issue

rafaelsideguide commented 3 months ago

@ahasasjeb did this solve the issue?

ahasasjeb commented 3 months ago

I haven't tested yet, I haven't had much time lately ---- Replied Message ---- From Rafael @.> Date 07/02/2024 20:57 To @.> Cc @.>@.> Subject Re: [mendableai/firecrawl] After I deployed the project locally, he couldn't crawl the webpage and kept waiting (Issue #340) @ahasasjeb did this solve the issue? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

notV3NOM commented 3 months ago

Facing the same issue

Set the env variable PLAYWRIGHT_MICROSERVICE_URL to PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html should solve the issue

This is already set

gubinjie commented 2 months ago

set USE_DB_AUTHENTICATION=false

notV3NOM commented 2 months ago

set USE_DB_AUTHENTICATION=false

Already set to false

kun432 commented 2 months ago

Facing the same issue. set all variables the same as above, tried on my mac as localhost and ubuntu server in my LAN, same issue happend.

BTW, /scrape works, /crawl not.

kun432 commented 2 months ago

maybe need this?

  worker:
    (snip)
    command: [ "pnpm", "run", "worker:production" ]
rafaelsideguide commented 2 months ago

@kun432 this makes a lot of sense! It looks like there's no worker for dealing with crawling jobs. Have you tested this solution?

kun432 commented 2 months ago

@rafaelsideguide Yep. Before adding the line above, worker started and soon exited with 0. Now seems jobs have done and completed.

rafaelsideguide commented 2 months ago

just PR' (#396) as a fix for that. Awesome @kun432 ! Thank you so much for noticing it.

nickscamara commented 2 months ago

Merged! Closing this. Thanks @kun432