Open hugomoran159 opened 3 weeks ago
Scrape metadata result for me.
{'title': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'description': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'keywords': 'The Quays Shopping Centre Newry Northern Ireland 2022', 'robots': 'ALL', 'ogTitle': 'Home | The Quays Shopping Centre, Newry - Northern Ireland', 'ogDescription': 'Have It All - The Quays Shopping Centre, Newry. Gift Cards | Shops | Food | Cinema', 'ogUrl': 'thequays.co.uk', 'ogImage': '~/images/general/logo.png', 'ogLocaleAlternate': [], 'sourceURL': 'https://www.thequays.co.uk/'}
worker did not crash. using fresh main
/Crawl on the other hand does not work. Playwright is timing out or crashing, scrape falls back to Fetch, after a long time it returns a very large payload.
playwright-service-1 | [2024-06-14 22:22:00 +0000] [9] [ERROR] Error in ASGI Framework
...
playwright-service-1 | playwright._impl._errors.TimeoutError: Timeout 15000ms exceeded.
Describe the Bug Provide a clear and concise description of what the bug is. When using SELF HOST,
worker-1
exists with error code137 ELIFECYCLE
when crawlinghttps://www.thequays.co.uk
This issue does not occur for other URLs. firecrawl-py used to make api call to localhost. firecrawl.py file modified to remove api_key requirement.To Reproduce Steps to reproduce the issue:
.env
To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false
===== Optional ENVS ======
Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= SUPABASE_URL= SUPABASE_SERVICE_TOKEN=
Other Optionals
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key RATE_LIMIT_TEST_API_KEY_SCRAPE= # set if you'd like to test the scraping rate limit RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.) BULL_AUTH_KEY= @ LOGTAIL_KEY= # Use if you're configuring basic logging with logtail LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs POSTHOG_HOST= # set if you'd like to send posthog events like job logs
STRIPE_PRICE_ID_STANDARD= STRIPE_PRICE_ID_SCALE= STRIPE_PRICE_ID_STARTER= STRIPE_PRICE_ID_HOBBY= STRIPE_PRICE_ID_HOBBY_YEARLY= STRIPE_PRICE_ID_STANDARD_NEW= STRIPE_PRICE_ID_STANDARD_NEW_YEARLY= STRIPE_PRICE_ID_GROWTH= STRIPE_PRICE_ID_GROWTH_YEARLY=
HYPERDX_API_KEY= HDX_NODE_BETA_MODE=1
FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta
Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER= PROXY_USERNAME= PROXY_PASSWORD=
set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=
Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=
Resend API Key for transactional emails
RESEND_API_KEY=
Expected Behavior Crawler finishes and outputs content json. Worker-1 does not exit. Next api request can be processed.
Environment (please complete the following information):
Also note - Self Host playwright path in .env.example needs to be changed to http://[::]:3000/html