johnlindquist / kit

Script Kit. Automate Anything.
https://scriptkit.com
MIT License
3.91k stars 138 forks source link

Scraping a site not returning result when using headless false. #1219

Open Emiltayeb opened 1 year ago

Emiltayeb commented 1 year ago

Hey! thanks for this lovely tool. I have an issue where running scrape in headless vs none headless mode, getting a different result

const newReleases = await scrapeSelector("https://music.youtube.com/new_releases/albums", "#items", undefined, {
    headless: false,
    timeout: 10000,
})

dev(newReleases) // headless false: array with result, headless:true empty array

running this in headless false will work while headless true won't. any ideas?

johnlindquist commented 1 year ago

Hi @Emiltayeb , sorry, no ideas here.

The scrapeSelector is a wrapper around Playwright: https://playwright.dev/

In fact, here's the source:

global.scrapeSelector = async (
  url: string,
  selector: string,
  xf?: (element: any) => any,
  { headless = true, timeout = 10000, browserOptions } = {
    headless: true,
    timeout: 10000,
  }
) => {
  /** @type typeof import("playwright") */
  const { chromium } = await global.npm("playwright")

  if (!xf) xf = el => el.innerText
  const browser = await chromium.launch({ headless })

  try {
    const context = await browser.newContext(browserOptions)
    const page = await context.newPage()
    page.setDefaultTimeout(timeout)

    if (!url.startsWith("http")) url = "https://" + url
    await page.goto(url)

    const locators = await page.locator(selector).all()
    const results = await Promise.all(
      locators.map(locator => locator.evaluate(xf))
    )
    return results
  } catch (ex) {
    throw ex
  } finally {
    await browser.close()
  }
}

So googling/asking around on the Playwright forums/issues/etc is where I would go to look.

Emiltayeb commented 1 year ago

Hi @Emiltayeb , sorry, no ideas here.

The scrapeSelector is a wrapper around Playwright: https://playwright.dev/

In fact, here's the source:

global.scrapeSelector = async (
  url: string,
  selector: string,
  xf?: (element: any) => any,
  { headless = true, timeout = 10000, browserOptions } = {
    headless: true,
    timeout: 10000,
  }
) => {
  /** @type typeof import("playwright") */
  const { chromium } = await global.npm("playwright")

  if (!xf) xf = el => el.innerText
  const browser = await chromium.launch({ headless })

  try {
    const context = await browser.newContext(browserOptions)
    const page = await context.newPage()
    page.setDefaultTimeout(timeout)

    if (!url.startsWith("http")) url = "https://" + url
    await page.goto(url)

    const locators = await page.locator(selector).all()
    const results = await Promise.all(
      locators.map(locator => locator.evaluate(xf))
    )
    return results
  } catch (ex) {
    throw ex
  } finally {
    await browser.close()
  }
}

So googling/asking around on the Playwright forums/issues/etc is where I would go to look.

Thanks for the quick reply. I'll dig in and check it out