ulixee / hero

The web browser built for scraping
MIT License
855 stars 43 forks source link

feat(client): waitForPageState(states, options) #26

Closed blakebyrnes closed 3 years ago

blakebyrnes commented 3 years ago

tab.waitForPageState(states, options) {#wait-for-page-state}

Wait for one of many possible states to be loaded. This feature allows you to detect which page has been loaded based on a series of states, each containing one or more assertions that test for a particular page or template.

Why?

A common scraping challenge is to identify "which" page has loaded - which template? Is it a captcha? Is it the desired url? Am I redirecting through ad networks?

Hero normally operates by running each command in order. This makes it a challenge to check for many possible states concurrently. waitForPageState runs the underlying commands for all state assertions in one round-trip to Chrome. This client-side can then synchronously evaluate a series of assertions to determine the currently loaded Page State.

Each Page State receives assertion functions (assert and assertAny) as parameters. The assert function can be run one or more times - if all assertions return true, the given state will be resolved as the active Page State.

The assert(statePromise, assertionCallbackFn) function takes in a state "Promise" and will call the provided "assertion" callback function each time new state is available. Each assertionCallbackFn simply returns true/false if it's given a valid state.

NOTE: Null access exceptions are ignored, so you don't need to worry about individual assertion data being present in the assertion callbacks.

 const state = await hero.waitForPageState({
  ulixee({ assert }) {
    assert(hero.url, url => url === 'https://ulixee.org'); // once the url is resolved, it will be checked against https://ulixee.org
    assert(hero.isPaintingStable); // the default function evaluates if the result is true
  },
  dlf({ assert }) {
    assert(hero.url, 'https://dataliberationfoundation.org'); // a value will be tested for equality
    assert(hero.isPaintingStable);
    assert(hero.document.querySelector('h1').textContent, text => text === "It's Time to Open the Data Economy");
  }
 });

Advanced:

The provided assertAny function passed into each assertion has a function that will allow you to match on some portion of assertions assertAny(minimumValid: number, assertions: assert()[]).

This can be useful if you know of a few cases that might be true, but some cases are flaky. For instance, if a loading indicator might be present, or a modal window, and you'd like to treat those as one state.

 const state = await hero.waitForPageState({
  ulixee({ assert, assertAny }) {
    assertAny(2, [
       assert(hero.document.querySelectorAll('[placeholder="Search datasets..."]').length, x => x === 3),
       assert(hero.document.querySelectorAll('.terms > li').length, x => x > 4),
       assert(hero.url, 'https://ulixee.org'),
    ]);
  },
  dlf({ assert }) {
    ...
  }
 });

Arguments:

Returns: Promise<StateName>