simonw / shot-scraper

A command-line utility for taking automated screenshots of websites
https://shot-scraper.datasette.io
Apache License 2.0
1.66k stars 73 forks source link

Ability to run `shot-scraper javascript` against several URLs at once #148

Open simonw opened 6 months ago

simonw commented 6 months ago

I found myself wanting to use the Readability trick against multiple URLs, without having to pay the startup cost of launching a new Chromium instance for each one.

Idea: a way to run shot-scraper javascript against more than one URL, returning an array of results.

simonw commented 6 months ago

Challenge: the current UI for that command is:

shot-scraper javascript $URL $JAVASCRIPT

How would passing multiple URLs work? It would be easier if JavaScript came first as then you could tag on multiple URLs as positional options, but that doesn't feel right against the current design.

Some options:

simonw commented 6 months ago

I built a prototype of that second option:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 3f1245e..86fc7b4 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -653,6 +653,13 @@ def accessibility(
     is_flag=True,
     help="Output JSON strings as raw text",
 )
+@click.option(
+    "multis",
+    "-m",
+    "--multi",
+    help="Run same JavaScript against multiple pages",
+    multiple=True,
+)
 @browser_option
 @browser_args_option
 @user_agent_option
@@ -668,6 +675,7 @@ def javascript(
     auth,
     output,
     raw,
+    multis,
     browser,
     browser_args,
     user_agent,
@@ -704,9 +712,26 @@ def javascript(

     If a JavaScript error occurs an exit code of 1 will be returned.
     """
+    # Special case for --multi - if multis are provided but JavaScript
+    # positional option was not set, assume the first argument is JS
+    if multis and not javascript:
+        javascript = url
+        url = None
+
+    # If they didn't provide JavaScript, assume it's being piped in
     if not javascript:
         javascript = input.read()
-    url = url_or_file_path(url, _check_and_absolutize)
+
+    to_process = []
+    if url:
+        to_process.append(url_or_file_path(url, _check_and_absolutize))
+    to_process.extend(url_or_file_path(multi, _check_and_absolutize) for multi in multis)
+
+    results = []
+
+    if len(to_process) > 1 and not raw:
+        output.write("[\n")
+
     with sync_playwright() as p:
         context, browser_obj = _browser_context(
             p,
@@ -719,18 +744,28 @@ def javascript(
             auth_username=auth_username,
             auth_password=auth_password,
         )
-        page = context.new_page()
-        if log_console:
-            page.on("console", console_log)
-        response = page.goto(url)
-        skip_or_fail(response, skip, fail)
-        result = _evaluate_js(page, javascript)
+        for i, url in enumerate(to_process):
+            is_last = i == len(to_process) - 1
+            page = context.new_page()
+            if log_console:
+                page.on("console", console_log)
+            response = page.goto(url)
+            skip_or_fail(response, skip, fail)
+            result = _evaluate_js(page, javascript)
+            if raw:
+                output.write(str(result) + "\n")
+            else:
+                output.write(
+                    json.dumps(result, indent=4, default=str) + ("\n" if is_last else ",\n")
+                )
+
         browser_obj.close()
-    if raw:
-        output.write(str(result))
-        return
-    output.write(json.dumps(result, indent=4, default=str))
-    output.write("\n")
+
+    if len(to_process) > 1 and not raw:
+        output.write("]\n")
+
+    if len(results) == 1:
+        results = results[0]

 @cli.command()

Then used like this:

shot-scraper javascript "
async () => {
  const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
  return (new readability.Readability(document)).parse();
}" \
-m https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ \
-m https://simonwillison.net/2024/Mar/26/llm-cmd/ \
-m https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ \
-m https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/ \
-m https://simonwillison.net/2024/Mar/16/weeknotes-the-aftermath-of-nicar/ | tee /tmp/all.json

It worked, but I'm not sure if the design is right - in particular it feels inconsistent with how shot-scraper multi works.

dynabler commented 5 months ago

Here are some idea's I have come across in other scraping tools:

url: https://example.com urls: [https://example.com/page/{},1,243] # range through pages 1 to 243 urls:[...range(https://example.com/page/{},1,243)] # with an explicit range and some fuction needed urls: ['https://example.com/', 'https://google.com', 'https://bing.com']

import urls from "./example_page_links.txt"
urls: urls.split("\n"),

Side Note: going through all the research stuff in issues: it's perhaps an idea to allow shot-scraper to use a config file. That way, all arguments you can pass in command line can be put neatly in a config file.