mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
15.09k stars 1.09k forks source link

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

Open twilwa opened 4 hours ago

twilwa commented 4 hours ago

Describe the Bug Note: I'll mentioned I've deployed via Coolify & docker-compose, so my setup might be a little wonky. That said if there's anything to check, I'd love some direction. When calling /scrape: [2024-09-28T20:57:32.113Z]DEBUG - Fetching sitemap links from https://mendable.ai [2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication [2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication [2024-09-28T21:01:07.525Z]DEBUG - [Crawl] Failed to get robots.txt (this is probably fine!): {"message":"Request failed with status code 404","name":"AxiosError","stack":"AxiosError: Request failed with status code 404\n at settle (/app/node_modules/.pnpm/axios@1.7.2/node_modules/axios/dist/node/axios.cjs:1983:12)\n at BrotliDecompress.handleStreamEnd (/app/node_modules/.pnpm/axios@1.7.2/node_modules/axios/dist/node/axios.cjs:3085:11)\n at BrotliDecompress.emit (node:events:531:35)\n at endReadableNT (node:internal/streams/readable:1696:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n at Axios.request (/app/node_modules/.pnpm/axios@1.7.2/node_modules/axios/dist/node/axios.cjs:4224:41)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async WebCrawler.getRobotsTxt (/app/dist/src/scraper/WebScraper/crawler.js:120:26)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:52:21)","config":{"transitional":{"silentJSONParsing":true,"forcedJSONParsing":true,"clarifyTimeoutError":false},"adapter":["xhr","http","fetch"],"transformRequest":[null],"transformResponse":[null],"timeout":3000,"xsrfCookieName":"XSRF-TOKEN","xsrfHeaderName":"X-XSRF-TOKEN","maxContentLength":-1,"maxBodyLength":-1,"env":{},"headers":{"Accept":"application/json, text/plain, /","User-Agent":"axios/1.7.2","Accept-Encoding":"gzip, compress, deflate, br"},"method":"get","url":"https://mendable.ai/robots.txt","axios-retry":{"retries":3,"shouldResetTimeout":false,"validateResponse":null,"retryCount":0,"lastRequestTime":1727557267405}},"code":"ERR_BAD_REQUEST","status":404}

As far as I can tell it just hangs forever. That said, the requests that are getting returned seem to be succeeding:

anon@pop-os:~$ curl -X POST http://api-firecrawl.x-ware.online:3002/v1/crawl -H 'Content-Type: application/json' -d '{ "url": "https://mendable.ai" }' {"success":true,"id":"35d7987d-e160-4a07-836f-0c776c3736ae","url":"https://api-firecrawl.x-ware.online:3002/v1/crawl/35d7987d-e160-4a07-836f-0c776c3736ae}

And I can visit the corresponding job page:

{"success":true,"status":"scraping","completed":0,"total":1,"creditsUsed":1,"expiresAt":"2024-09-29T21:01:07.000Z","next":"https://api-firecrawl.x-ware.online:3002/v1/crawl/9f34da99-1022-490b-988b-65c4f2d9c8d2?skip=0","data":[]}

(different job, just had the tab open, all the mendable attempts return like that, haven't tested much else.)

When calling /scrape, I get a timeout. When I try to visit api-firecrawl.x-ware.online (the domain i'm directing api traffic to) on port 3000, I do see the following simple HTML page: SCRAPERS-JS: Hello, world! Fly.io

To Reproduce Steps to reproduce the issue: Deploy via coolify through the 'repo' option with 'docker-compose' as the build utility. Set the following env vars: BLOCK_MEDIA= BULL_AUTH_KEY= HOST=0.0.0.0 LLAMAPARSE_API_KEY= LOGGING_LEVEL= LOGTAIL_KEY= MODEL_NAME=gpt-4o NUM_WORKERS_PER_QUEUE= OPENAI_API_KEY= OPENAI_BASE_URL= PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000 PORT=3002 POSTHOG_API_KEY= POSTHOG_HOST= PROXY_PASSWORD= PROXY_SERVER= PROXY_USERNAME= REDIS_URL=redis://redis:6379 SCRAPING_BEE_API_KEY= SELF_HOSTED_WEBHOOK_URL= SLACK_WEBHOOK_URL= SUPABASE_ANON_TOKEN=[redacted] SUPABASE_SERVICE_TOKEN=[redacted] SUPABASE_URL=https://supabasekong.x-ware.online/ TEST_API_KEY= USE_DB_AUTHENTICATION=false

  1. Run the command '...'

Run the api calls, error messages, and container logs described above.

Expected Behavior Crawl and scrape function normally.

Screenshots If applicable, add screenshots or copies of the command line output to help explain the issue.

Environment (please complete the following information):

Logs Logs found above.

Additional Context Networking handled by traefik via coolify

nickscamara commented 4 hours ago

Thanks for the report @twilwa! That's quite odd. ccing @rafaelsideguide here to take a look