ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

best practices for custom injection #238

Closed andynuss closed 5 months ago

andynuss commented 9 months ago

Hi,

Previously I worked with secret-agent, and worked out how to port a puppeteer scraping script of my own, whose focus is on capturing needs that require a single injection of my own large script for the main page and a few iframes if needed. In other words, I was using secret-agent as a sort of replacement for Playwright, and perhaps still wish to do the same with Hero, but still gain the advantage of avoiding being easily fingerprinted.

That is, both on my testing machine, and on my own ec2 instance, I wish to to setup my own simply nodejs http server process, which responds to my restful scraping requests to launch a Hero instance to scrape using the Chrome installed on that machine.

I wish to port to Hero and probably have to relearn some basic concepts:

Andy

BTW: here is my first sketch:

import Hero from '@ulixee/hero';

async function main() {

  // Create a new Hero browser session
  const hero = new Hero({
    // Q: this would defeat Hero's help in this area, but it would be
    // nice to provide a range!
    viewport: {
      width: 1000,
      height: 750,
      screenHeight: 1024,
      screenWidth: 768,
    },
  });

  // Go to webpage.  The resource result includes { url, request, response }
  // and can be used to verify status = 200
  const rsrc = await hero.goto('https://example.com');
  const tab = hero.activeTab;

  // Q: is there something better?  What about a bit of user emulation?
  await tab.waitForPaintingStable();

  // Q: in both frame.executeJs and tab.executeJs (below), there does not
  // seem to be any direct usage of ClientPlugin, CorePlugin,
  // ConnectionToHeroCore, Tab, Session, etc?  Would I ever need to
  // use any of these, such as for getting font resources requested?
  const frames = await tab.frameEnvironments;
  for (const frame of frames) {

    const { frameId, url, parentFrameId, name, document, isDomContentLoaded, isMainFrame } = frame;
    const result = await frame.executeJs(() => {
      return 'hello from frame';
    });

    // ... I may choose to go 1 level deeper
    const children = await frame.children;
  }

  const result = await tab.executeJs(() => {
    return 'hello from main window';
  });

  // Q: should I take the screenshot here?

  // Close Hero instance
  await hero.close();
}

main();
blakebyrnes commented 9 months ago

do I want to be importing from '@ulixee/hero'?

Yes, for the client library

am I to be using Hero Core as a means of controlling the injected javascript?

You want to use the execute-js plugin. I believe it's the same thing as you have here and used in SecretAgent

will I still be able to do check which font resources were actually loaded by the page?

Sure. Same as SecretAgent.

when I create a new instance of Hero, if I give any viewport preferences, can I do so without defeating Hero's browser emulation?

This doesn't matter for emulation

this is to have some reasonable control over the size of the screen capture

Makes sense. You could also probably shrink the image that comes back using some kind of image lib

can I let Hero control emulated scrolling before as part of the hero.waitForXXX() call? or would it be after?

You probably want to wait for some stability, but I often just wait for a particular element to be on the page.