mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
12.92k stars 913 forks source link

Error Scraping PDF and Regular Website with Locally Hosted Firecrawl #411

Open nagendrakumar02 opened 1 month ago

nagendrakumar02 commented 1 month ago

I am currently facing some issues with our local web scraping setup and would appreciate your insights. Below are the details of the problems encountered: URL: Electric Vehicle Charging Equipment Rebates PDF Error in Console: The console displays multiple errors when attempting to process this PDF document and the returned data does not contain markdown content ImageImage Issue: The markdown content is not being returned. Code Snippet: firecrawl = FirecrawlApp( api_key="some key", api_url=http://localhost:3002/, ) page_content = firecrawl.scrape_url(url=https://www.myelectric.coop/wp-content/uploads/Electric-Vehicle-Charging-Equipment-Rebates.pdf) return page_content

URL: PEV Off-Peak Savers Program Rebate Issue: Similar errors are being logged in the console for this URL. Image Attached is a screenshot illustrating the errors encountered in the console during the scraping process. I would greatly appreciate any assistance or suggestions you could provide to resolve these issues. Thank you in advance for your help

image image

image

theodufort commented 1 month ago

Hey, just wondering, whats the env variable you used on the api side to set your custom api_key?

Im trying doing this self hosted too

nagendrakumar02 commented 1 month ago

When you are self hosting you don't need a key. You can set it to anything. Here is a copy of my .env file

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=

# Other Optionals
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
RATE_LIMIT_TEST_API_KEY_SCRAPE= # set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
LLAMAPARSE_API_KEY=  #Set if you have a llamaparse key you'd like to use to parse pdfs
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
POSTHOG_HOST= # set if you'd like to send posthog events like job logs

STRIPE_PRICE_ID_STANDARD=
STRIPE_PRICE_ID_SCALE=
STRIPE_PRICE_ID_STARTER=
STRIPE_PRICE_ID_HOBBY=
STRIPE_PRICE_ID_HOBBY_YEARLY=
STRIPE_PRICE_ID_STANDARD_NEW=
STRIPE_PRICE_ID_STANDARD_NEW_YEARLY=
STRIPE_PRICE_ID_GROWTH=
STRIPE_PRICE_ID_GROWTH_YEARLY=

HYPERDX_API_KEY=
HDX_NODE_BETA_MODE=1

FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=