Open simonw opened 6 months ago
Challenge: the current UI for that command is:
shot-scraper javascript $URL $JAVASCRIPT
How would passing multiple URLs work? It would be easier if JavaScript came first as then you could tag on multiple URLs as positional options, but that doesn't feel right against the current design.
Some options:
javascript-multi
- similar to how shot-scraper multi
works in taking multiple screenshots at once-m
multi-option to the javascript
command and teach it to do those as well as the first one
shot-scraper javascript $JAVASCRIPT -m $URL1 -m $URL2
works - because it treats that first argument as the JavaScript in the case where there is only one positional argument and at least one -m
optionshot-scraper javascript $JAVASCRIPT --urls $FILENAME
which takes URLS from a file (or -
for standard input) rather than expecting them to be passed as -m
optionsI built a prototype of that second option:
diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 3f1245e..86fc7b4 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -653,6 +653,13 @@ def accessibility(
is_flag=True,
help="Output JSON strings as raw text",
)
+@click.option(
+ "multis",
+ "-m",
+ "--multi",
+ help="Run same JavaScript against multiple pages",
+ multiple=True,
+)
@browser_option
@browser_args_option
@user_agent_option
@@ -668,6 +675,7 @@ def javascript(
auth,
output,
raw,
+ multis,
browser,
browser_args,
user_agent,
@@ -704,9 +712,26 @@ def javascript(
If a JavaScript error occurs an exit code of 1 will be returned.
"""
+ # Special case for --multi - if multis are provided but JavaScript
+ # positional option was not set, assume the first argument is JS
+ if multis and not javascript:
+ javascript = url
+ url = None
+
+ # If they didn't provide JavaScript, assume it's being piped in
if not javascript:
javascript = input.read()
- url = url_or_file_path(url, _check_and_absolutize)
+
+ to_process = []
+ if url:
+ to_process.append(url_or_file_path(url, _check_and_absolutize))
+ to_process.extend(url_or_file_path(multi, _check_and_absolutize) for multi in multis)
+
+ results = []
+
+ if len(to_process) > 1 and not raw:
+ output.write("[\n")
+
with sync_playwright() as p:
context, browser_obj = _browser_context(
p,
@@ -719,18 +744,28 @@ def javascript(
auth_username=auth_username,
auth_password=auth_password,
)
- page = context.new_page()
- if log_console:
- page.on("console", console_log)
- response = page.goto(url)
- skip_or_fail(response, skip, fail)
- result = _evaluate_js(page, javascript)
+ for i, url in enumerate(to_process):
+ is_last = i == len(to_process) - 1
+ page = context.new_page()
+ if log_console:
+ page.on("console", console_log)
+ response = page.goto(url)
+ skip_or_fail(response, skip, fail)
+ result = _evaluate_js(page, javascript)
+ if raw:
+ output.write(str(result) + "\n")
+ else:
+ output.write(
+ json.dumps(result, indent=4, default=str) + ("\n" if is_last else ",\n")
+ )
+
browser_obj.close()
- if raw:
- output.write(str(result))
- return
- output.write(json.dumps(result, indent=4, default=str))
- output.write("\n")
+
+ if len(to_process) > 1 and not raw:
+ output.write("]\n")
+
+ if len(results) == 1:
+ results = results[0]
@cli.command()
Then used like this:
shot-scraper javascript "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}" \
-m https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ \
-m https://simonwillison.net/2024/Mar/26/llm-cmd/ \
-m https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ \
-m https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/ \
-m https://simonwillison.net/2024/Mar/16/weeknotes-the-aftermath-of-nicar/ | tee /tmp/all.json
It worked, but I'm not sure if the design is right - in particular it feels inconsistent with how shot-scraper multi
works.
Here are some idea's I have come across in other scraping tools:
url: https://example.com
urls: [https://example.com/page/{},1,243] # range through pages 1 to 243
urls:[...range(https://example.com/page/{},1,243)] # with an explicit range and some fuction needed
urls: ['https://example.com/', 'https://google.com', 'https://bing.com']
import urls from "./example_page_links.txt"
urls: urls.split("\n"),
Side Note: going through all the research stuff in issues: it's perhaps an idea to allow shot-scraper to use a config file. That way, all arguments you can pass in command line can be put neatly in a config file.
I found myself wanting to use the Readability trick against multiple URLs, without having to pay the startup cost of launching a new Chromium instance for each one.
Idea: a way to run
shot-scraper javascript
against more than one URL, returning an array of results.