mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.39k stars 1.26k forks source link

[Bug] timeout parameter not passed to playwright service #734

Closed mschfh closed 1 week ago

mschfh commented 2 weeks ago

Describe the Bug

The timeout parameter is not passed to playwright-service.

To Reproduce Steps to reproduce the issue:

  1. Configure the docker-compose setup with the python-based microservice:
services:
  playwright-service:
    build: apps/playwright-service

.env:

PLAYWRIGHT_MICROSERVICE_URL: http://playwright-service:3000/html
  1. Send an API request:

    {
    "url": "https://[removed]",
    "timeout": 60000,
    "waitFor": 30000,
    "formats": [
        "markdown"
    ]
    }
  2. Observe that the request sent to the microservice omits the timeout:

POST /html HTTP/1.1
Host: playwright-service:3000
[..]

{"url":"[removed]","wait_after_load":30000}

The log displays an error with the default timeout of 15000ms:

playwright-service-1  | playwright._impl._errors.TimeoutError: Page.goto: Timeout 15000ms exceeded.
playwright-service-1  | Call log:
playwright-service-1  | navigating to "https://[removed]", waiting until "load"

Expected Behavior The timeout is passed to the playwright-service and used for Page.goto

Additional Context The service expects a timeout parameter in the body: https://github.com/mendableai/firecrawl/blob/a40fb3b062dfee4d1dd79c4c4946f2f418da32c7/apps/playwright-service/main.py#L91-L95

The playwright integration is not passing the parameter: https://github.com/mendableai/firecrawl/blob/a40fb3b062dfee4d1dd79c4c4946f2f418da32c7/apps/api/src/scraper/WebScraper/scrapers/playwright.ts#L38-L44

The suggested fix would be passing that parameter in the integration.

Harsh0707005 commented 1 week ago

Please assign me this issue.