Open pixeleet opened 5 months ago
Okay I sat down and investigated and I think this is sort of a logic issue with this code. Something that's a bit unfortunate about element handles the way they work right now is that if the page transforms, it's likely that those handles are completely different. Essentially, here when you call el.click()
, the page reconfigures itself and previous references become invalid. There is a universe that with locators this wouldn't happen but the locator API isn't fully fleshed out yet so I won't recommend it for now.
The correct way of implementing this would actually be requesting the nth child of some parent div in a while loop. Something like this should work just fine:
import { ElementHandle, launch } from "jsr:@astral/astral@0.4.2";
const browser = await launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://www.google.com/search?true&source=lnms&tbm=isch&sa=X&tbs=isz:l&hl=en&q=Barceloneta%2C+Beach");
await page.waitForSelector('div[role="dialog"]', { timeout: 30_000 });
const [accept_cookies] = await page.$$("div[role='dialog'] button")
.then((buttons) =>
Promise.all(
buttons.map((b, i) => b.innerHTML().then((it): [ElementHandle, string] => [buttons[i], it])),
)
)
.then((it) => it.filter(([_, html]) => html.toLowerCase().includes("accept")))
.then((it) => it.map(([el, _]) => el));
if (accept_cookies) {
await accept_cookies?.click?.();
await page.waitForNavigation({ waitUntil: "none" });
}
const images = [];
let i = 0;
while(true) {
const pageImages = await page.$$("g-img");
const cur = pageImages[i++];
const className = await cur.evaluate((it: HTMLElement) => !!it.className);
if (!className) continue;
await cur.click();
await page.waitForNavigation({ waitUntil: "none" });
await page.waitForSelector('img[aria-hidden="false"]', { timeout: 30_000 });
const image = await page.evaluate(() => {
const img = document.querySelector('img[aria-hidden="false"]')! as HTMLImageElement;
if (img) {
return {
title: img.closest("c-wiz")?.querySelector("h1")?.innerText!,
url: img.src,
source: (img.parentElement as HTMLLinkElement).href,
};
}
throw new Error("Image not found");
});
images.push(image);
if(images.length === 20) {
break;
}
}
console.log(images);
Going to give it a try and let you know if this approach worked out, thank you so much for your response.
This code has essentially been moved from puppeteer to astral, so while this works, for this specific script, I bet we're breaking a lot of unwritten rules / expectations when element handles are lost in page transitions (if I understand correctly)
We're trying to move everything to locators (#2) ASAP, so hopefully this will not be a problem soon. Puppeteer has these unwritten rules but they do a lot of work to make them not show up too often. Unfortunately, when they do show up, it's basically impossible to debug due to the weird hacks they use.
TLDR; locators fix this elegantly, they're just not quite ready yet.
Deno v.1.44.1 Astral 0.4.2
reproduction:
Happy to open a PR given some pointers what to fix.