lino-levan / astral

A high-level puppeteer/playwright-like library for Deno
https://jsr.io/@astral/astral
MIT License
215 stars 10 forks source link

ElementHandle fails to find node by id #75

Open pixeleet opened 3 months ago

pixeleet commented 3 months ago

Deno v.1.44.1 Astral 0.4.2

deno run -A --unstable playground/elementhandle_bug.ts
⚠️  The `--unstable` flag is deprecated and will be removed in Deno 2.0. Use granular `--unstable-*` flags instead.
Learn more at: https://docs.deno.com/runtime/manual/tools/unstable_flags
clicked
loaded
error: Uncaught (in promise) Error: Unable to get stable box model to click on
    if (!model) throw new Error("Unable to get stable box model to click on");
                      ^
    at ElementHandle.click (https://jsr.io/@astral/astral/0.4.2/src/element_handle.ts:180:23)
    at eventLoopTick (ext:core/01_core.js:168:7)
    at async file:///Users/DP/workspace/github.com/VojAI/etl/playground/elementhandle_bug.ts:32:7
    at async file:///Users/DP/workspace/github.com/VojAI/etl/playground/elementhandle_bug.ts:24:16

reproduction:

import { ElementHandle, launch } from "jsr:@astral/astral@0.4.2";

const browser = await launch({ headless: false });
const page = await browser.newPage();

await page.goto("https://www.google.com/search?true&source=lnms&tbm=isch&sa=X&tbs=isz:l&hl=en&q=Barceloneta%2C+Beach");

await page.waitForSelector('div[role="dialog"]', { timeout: 30_000 });

const [accept_cookies] = await page.$$("div[role='dialog'] button")
  .then((buttons) =>
    Promise.all(
      buttons.map((b, i) => b.innerHTML().then((it): [ElementHandle, string] => [buttons[i], it])),
    )
  )
  .then((it) => it.filter(([_, html]) => html.toLowerCase().includes("accept")))
  .then((it) => it.map(([el, _]) => el));

if (accept_cookies) {
  await accept_cookies?.click?.();
  await page.waitForNavigation({ waitUntil: "none" });
}

const images = await page.$$("g-img")
  .then((els) => Promise.all(els.map((el) => Promise.all([el, el.evaluate((it: HTMLElement) => !!it.className)]))))
  .then((it) => it.filter(([_, className]) => className))
  .then((it) => it.slice(0, 20))
  .then((it) => it.map(([el]) => el))
  .then(async (els) => {
    const imgs = [];
    for (const el of els) {
      await el.click();
      console.log("clicked");
      await page.waitForNavigation({ waitUntil: "none" });
      console.log("loaded");
      await page.waitForSelector('img[aria-hidden="false"]', { timeout: 30_000 });
      console.log("getting image");
      const image = await page.evaluate(() => {
        const img = document.querySelector('img[aria-hidden="false"]')! as HTMLImageElement;
        if (img) {
          return {
            title: img.closest("c-wiz")?.querySelector("h1")?.innerText!,
            url: img.src,
            source: (img.parentElement as HTMLLinkElement).href,
          };
        }
        throw new Error("Image not found");
      });
      console.log("got image", image);
      imgs.push(image);
    }
  });

console.log(images);

Happy to open a PR given some pointers what to fix.

lino-levan commented 3 months ago

Okay I sat down and investigated and I think this is sort of a logic issue with this code. Something that's a bit unfortunate about element handles the way they work right now is that if the page transforms, it's likely that those handles are completely different. Essentially, here when you call el.click(), the page reconfigures itself and previous references become invalid. There is a universe that with locators this wouldn't happen but the locator API isn't fully fleshed out yet so I won't recommend it for now.

The correct way of implementing this would actually be requesting the nth child of some parent div in a while loop. Something like this should work just fine:

import { ElementHandle, launch } from "jsr:@astral/astral@0.4.2";

const browser = await launch({ headless: false });
const page = await browser.newPage();

await page.goto("https://www.google.com/search?true&source=lnms&tbm=isch&sa=X&tbs=isz:l&hl=en&q=Barceloneta%2C+Beach");

await page.waitForSelector('div[role="dialog"]', { timeout: 30_000 });

const [accept_cookies] = await page.$$("div[role='dialog'] button")
  .then((buttons) =>
    Promise.all(
      buttons.map((b, i) => b.innerHTML().then((it): [ElementHandle, string] => [buttons[i], it])),
    )
  )
  .then((it) => it.filter(([_, html]) => html.toLowerCase().includes("accept")))
  .then((it) => it.map(([el, _]) => el));

if (accept_cookies) {
  await accept_cookies?.click?.();
  await page.waitForNavigation({ waitUntil: "none" });
}

const images = [];
let i = 0;
while(true) {
  const pageImages = await page.$$("g-img");
  const cur = pageImages[i++];
  const className = await cur.evaluate((it: HTMLElement) => !!it.className);
  if (!className) continue;
  await cur.click();
  await page.waitForNavigation({ waitUntil: "none" });
  await page.waitForSelector('img[aria-hidden="false"]', { timeout: 30_000 });
  const image = await page.evaluate(() => {
    const img = document.querySelector('img[aria-hidden="false"]')! as HTMLImageElement;
    if (img) {
      return {
        title: img.closest("c-wiz")?.querySelector("h1")?.innerText!,
        url: img.src,
        source: (img.parentElement as HTMLLinkElement).href,
      };
    }
    throw new Error("Image not found");
  });
  images.push(image);

  if(images.length === 20) {
    break;
  }
}

console.log(images);
pixeleet commented 3 months ago

Going to give it a try and let you know if this approach worked out, thank you so much for your response.

pixeleet commented 3 months ago

This code has essentially been moved from puppeteer to astral, so while this works, for this specific script, I bet we're breaking a lot of unwritten rules / expectations when element handles are lost in page transitions (if I understand correctly)

lino-levan commented 3 months ago

We're trying to move everything to locators (#2) ASAP, so hopefully this will not be a problem soon. Puppeteer has these unwritten rules but they do a lot of work to make them not show up too often. Unfortunately, when they do show up, it's basically impossible to debug due to the weird hacks they use.

TLDR; locators fix this elegantly, they're just not quite ready yet.