simonw / shot-scraper

A command-line utility for taking automated screenshots of websites
https://shot-scraper.datasette.io
Apache License 2.0
1.7k stars 78 forks source link

--init-script support #147

Open simonw opened 7 months ago

simonw commented 7 months ago

Init scripts are special JavaScript that gets run to prime the page before the URL is loaded:

https://playwright.dev/python/docs/api/class-page#page-add-init-script

Adds a script which would be evaluated in one of the following scenarios:

  • Whenever the page is navigated.
  • Whenever the child frame is attached or navigated. In this case, the script is evaluated in the context of the newly attached frame.

The script is evaluated after the document was created but before any of its scripts were run. This is useful to amend the JavaScript environment, e.g. to seed Math.random.

This should be an option for shot and javascript and more.

simonw commented 7 months ago

One thing this can be useful for is taking screenshots of pages that detect and block headless Chrome. They seem to often do that by looking for navigator.webdriver.

https://www.news.com.au/ is an example:

shot-scraper https://www.news.com.au/  -h 600

www-news-com-au

But using the prototype from https://github.com/simonw/shot-scraper/commit/fae9babee52fc109c643501dd74cb9f75d18d19b and a tip from https://stackoverflow.com/a/75771301/6083

shot-scraper https://www.news.com.au/ -h 600 \
  --init-script 'delete Object.getPrototypeOf(navigator).webdriver' \
  --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0'

www-news-com-au

simonw commented 7 months ago

Asked ChatGPT for more ideas of things to do with init scripts: https://chat.openai.com/share/71c5302f-bb92-4bd8-8eb3-311d855311b0

A few that I really liked

browser_context.add_init_script("""
    Date.now = function() { return new Date('2024-01-01T00:00:00Z').getTime(); };
""")

browser_context.add_init_script("""
    const originalFetch = window.fetch;
    window.fetch = async function(...args) {
        if (args[0].includes('api.example.com')) {
            return new Response(JSON.stringify({ mocked: true }), { status: 200 });
        }
        return originalFetch(...args);
    };
""")

browser_context.add_init_script("""
    localStorage.setItem('key', 'value');
    document.cookie = 'name=value; path=/';
""")
simonw commented 7 months ago

Claude 3 Opus suggested "Simulate a specific device":

   page.add_init_script("""
       Object.defineProperty(window, 'innerWidth', {
           writable: true,
           configurable: true,
           value: 375,
       });
       Object.defineProperty(window, 'innerHeight', {
           writable: true,
           configurable: true,
           value: 812,
       });
   """)