lino-levan / astral

A high-level puppeteer/playwright-like library for Deno
https://jsr.io/@astral/astral
MIT License
175 stars 7 forks source link

Pierce iframes #77

Open liamdiprose opened 2 weeks ago

liamdiprose commented 2 weeks ago

I'm trying to scrape a page with an iframe but I can't query any elements inside it. I would prefer iframes are treated like every other element.

I've started to look at the underlying celestial API, and I found #celestrial.DOM.getDocument accepts a pierce option:

DOM = {
    // ...
    getDocument: async (opts: {
      /**
       * The maximum depth at which children should be retrieved, defaults to 1. Use -1 for the
       * entire subtree or provide an integer larger than 0.
       */
      depth?: number;
      /**
       * Whether or not iframes and shadow roots should be traversed when returning the subtree
       * (default is false).
       */
      pierce?: boolean;

https://github.com/lino-levan/astral/blob/main/bindings/celestial.ts#L15956

My attempt to get the full DOM:

  const celestrial = page.unsafelyGetCelestialBindings()

  const dom = await celestrial.DOM.getDocument({
    depth: -1,
    pierce: true
  })

  logger.info({ node_id: dom.root.nodeId }, "Got DOM Document")
  const root = new ElementHandle(dom.root.nodeId, celestrial, page)

  // Timeout
  await root.waitForSelector("#ConfirmFee")

As noted, the waitForSelector throws a timeout and querySelector returns null when querying for an element inside the iframe.

Does anyone know how to select elements inside iframes?

lino-levan commented 2 weeks ago

Unfortunately, pierce doesn't work the way you hope it would. This could be something interesting to look into though as a default.

liamdiprose commented 2 weeks ago

Thanks for the heads up. I had a look at the Playwright source, but learnt very little. This stack overflow answer helped me though.

Working celestial code:

const doc = await celestial.DOM.getDocument({ depth: 0 })

const frame_node = await celestial.DOM.querySelector({ nodeId: doc.root.nodeId, selector: "iframe#ed-embedded-iframe" })
const frame_description = await celestial.DOM.describeNode({ nodeId: frame_node.nodeId })
const frame_content_remote_object = await celestial.DOM.resolveNode({ backendNodeId: frame_description.node.contentDocument.backendNodeId })
const frame_content_doc = await celestial.DOM.requestNode({ objectId: frame_content_remote_object.object.objectId })

const frame = new ElementHandle(frame_content_doc.nodeId, celestial, page)

await frame.waitForSelector("#ConfirmFee")

Apparently it doesn't work if the iframe's origin is different; security reasons I guess.

It would be a seamless experience if we found a way to query into iframes transparently, but we would have to understand if a query applies to an iframe or not, then modify it to query inside the iframe. It may cause more pain in the long run.

However, if an ElementHandle is on an iframe, then we can forward the querySelector calls to the 'Frame' without too much complexity.

I'd be interested in helping with this project. Are you accepting PRs?

Cheers, Liam

lino-levan commented 1 week ago

Always happy to accept PRs. I'm curious on how you're thinking about this.