ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

.$detach() returns null when being used on image element #274

Open markfixgg opened 2 weeks ago

markfixgg commented 2 weeks ago

I am trying to extract image element from page to get it's as base64, there is few approaches that i see: 1) intercept HTTP request using tab.on("resource") 2) extract "src" and load image by making request 3) extract element from page and draw it inside of canvas and then we can use canvas.toDataURL to get base64

I prefer third approach as it is not intend to make extra calls and can be reused as many times as i want without affecting performance. I know for sure that .$detach() on image element worked before, because i tested it and had successful results... but now it returns NULL instead of ISuperElement

Here is snippet to replicate issue:

import Hero from '@ulixee/hero-playground';

(async () => {
    const hero = new Hero({
        showChrome: true,
        userAgent: '~ chrome >= 105 && windows >= 10'
    });

    await hero.goto('https://nopecha.com/demo/recaptcha#hard');

    const iframeElement = await hero.querySelector('div[class="g-recaptcha"] iframe').$waitForVisible();

    const iframe = await hero.getFrameEnvironment(iframeElement);
    if (!iframe) return console.log('Iframe not loaded');

    const checkbox = await iframe.querySelector('span[class*="recaptcha-checkbox"][aria-checked="false"]').$waitForVisible();
    await checkbox.$click();

    await (async () => {
        const iframeElement = await hero.querySelector('iframe[title*="challenge"]').$waitForVisible();

        const iframe = await hero.getFrameEnvironment(iframeElement);
        if (!iframe) return console.log('Iframe not loaded');

        const image = await iframe.querySelector('img[class*="rc-image-tile"]').$waitForVisible();

        console.log(image); // => image is loaded and attributes such as "src" can be extracted
        console.log(await image.$detach()); // => returns null
    })()

    await new Promise((resolve) => setTimeout(resolve, 60000));
})();
blakebyrnes commented 2 weeks ago

It seems like detach is indeed broken here. You could look in the logs/session database to try to figure out if there's any kind of error shown.

However, this won't work in detached dom in any case. Canvas doesn't produce dom changes, and we haven't yet built anything to record all the canvas changes that occur.

i think your best option is actually to use toDataURL() on the image itself in page. Does that api not work?

markfixgg commented 2 weeks ago

image doesn't have such method "toDataURL" if I am not wrong

Regarding canvas - I use canvas on NodeJS side:

import { createCanvas } from 'canvas';

export const getBase64Image = async (image: ISuperElement) => {
    const canvas = createCanvas(Number(await image.width), Number(await image.height));

    const ctx = canvas.getContext("2d");
              ctx.drawImage(await image.$detach() as any, 0, 0);

    return canvas.toDataURL("image/png").replace(/^data:image\/?[A-z]*;base64,/, '');
}
blakebyrnes commented 2 weeks ago

Sorry, I confused myself on this one. The 1st option is the preferred approach if these are http images (eg, not page drawn) since it won't require any extra work. The backend is already loading the image, so this is just a step of sending it to client. It will also exist in your session database if that's preferable. Is there a reason not to use 1st?

blakebyrnes commented 2 weeks ago

I guess you are wanting base64. The data will be raw buffer, so you would just add toString('base64') on a modern version of node.

markfixgg commented 2 weeks ago

This approach also acceptable for me and i am using it right now, works perfectly as well. But i would leave this issue open if you don't against. Thank you for your reply, and whenever i will have some free time, i will try to figure out why detach is not working on image elements, and maybe even will try to contribute to fix this issue

mpopov commented 1 week ago

Also .$detach() returns null when hero instance is created with viewport option set. Without viewport option it works okay.