simonw / shot-scraper

A command-line utility for taking automated screenshots of websites
https://shot-scraper.datasette.io
Apache License 2.0
1.57k stars 70 forks source link

Add a high level Python API #145

Open simonw opened 4 months ago

simonw commented 4 months ago

In this NICAR workshop: https://github.com/dwillis/shot-scraper-nicar24

This code: https://github.com/dwillis/shot-scraper-nicar24/blob/main/demo.py

def shotscraper_card(team, season):
    ncaa_id = team['ncaa_id']
    name = team['team']

    # JavaScript to be executed by shot-scraper
    javascript_code = """
    Array.from(document.querySelectorAll('.s-person-card__content'), el => {
        const id = '';
        const name = el.querySelector('.s-person-details__personal-single-line').innerText;
        const year = el.querySelectorAll('.s-person-details__bio-stats-item')[1].childNodes[1].wholeText.trim();
        let ht = el.querySelectorAll('.s-person-details__bio-stats-item')[2].childNodes[1].wholeText;
        const height = ht ? ht.trim() : '';
        const position = el.querySelectorAll('.s-person-details__bio-stats-item')[0].childNodes[1].textContent.trim()
        const hometown = el.querySelectorAll('.s-person-card__content__person__location-item')[0].childNodes[2].textContent.trim();
        let hs_el = el.querySelectorAll('.s-person-card__content__person__location-item')[1].childNodes[1].textContent;
        const high_school = hs_el ? hs_el.trim() : '';
        const previous_school = '';
        let j = el.querySelector('.s-stamp__text');
        const jersey = j ? j.innerText : '';
        const url = el.querySelector('a')['href']
        return {id, name, year, hometown, high_school, previous_school, height, position, jersey, url};
    })
    """

    roster = []
    url = team['url'] + "/roster/" + season
    # Execute shot-scraper with the given JavaScript
    try:
        result = subprocess.check_output(['shot-scraper', 'javascript', url, javascript_code, "--user-agent", "Firefox"])
        parsed_data = json.loads(result)

        for player in parsed_data:
            player['team_id'] = ncaa_id
            player['team'] = name
            player['season'] = season

        return parsed_data
    except:
        raise

It shouldn't be necessary to have to use subprocess to do something this straight-forward in shot-scraper. I'd like to support something like this instead:

import shot_scraper

result = shot_scraper.javascript(url, javascript_code, user_agent="Firefox")
simonw commented 4 months ago

Might be better to provide a class, so you can instantiate once (loading up the headless browser) and then use it for multiple things.

Or... do that, but still have a shot_scraper.javascript(...) shortcut for quick one-off tasks.

simonw commented 4 months ago

Initial rough API design:

shot_scraper.javascript(url, javascript_code) -> a JSON decoded result

With keyword arguments for most of these:

Options:
  -i, --input FILENAME            Read input JavaScript from this file
  -a, --auth FILENAME             Path to JSON authentication context file
  -o, --output FILENAME           Save output JSON to this file
  -r, --raw                       Output JSON strings as raw text
  -b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
                                  Which browser to use
  --browser-arg TEXT              Additional arguments to pass to the browser
  --user-agent TEXT               User-Agent header to use
  --reduced-motion                Emulate 'prefers-reduced-motion' media
                                  feature
  --log-console                   Write console.log() to stderr
  --fail                          Fail with an error code if a page returns an
                                  HTTP error
  --skip                          Skip pages that return HTTP errors
  --bypass-csp                    Bypass Content-Security-Policy
  --auth-password TEXT            Password for HTTP Basic authentication
  --auth-username TEXT            Username for HTTP Basic authentication

image_bytes = shot_scraper.shot(url)

With a TON of options, see https://shot-scraper.datasette.io/en/stable/screenshots.html#shot-scraper-shot-help


... etc

simonw commented 4 months ago

This is going to end up being a pretty big refactor, because I'll want the CLI tool to use the new Python API under the hood.

simonw commented 3 months ago

Prototyped this with Claude 3 Opus: https://gist.github.com/simonw/a43ee47f528c0d3dc894bb4ba38aa94a

davidbgk commented 2 months ago

Another use-case where I'd love to be able to call shot-scraper directly from Python.