Closed honzajde closed 2 years ago
What would HTML mode do?
You can run shot-scraper
against an HTML file on disk already, like this:
shot-scraper example.html
Unfortunately it doesn't look like it's possible to provide interactive mode for PDF printing.
I tried this prototype:
diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 04e4ef5..b5157b5 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -550,6 +550,12 @@ def javascript(
type=click.FloatRange(min=0.1, max=2.0),
help="Scale of the webpage rendering",
)
+@click.option(
+ "-i",
+ "--interactive",
+ is_flag=True,
+ help="Interact with the page in a browser before taking the shot",
+)
@click.option("--print-background", is_flag=True, help="Print background graphics")
def pdf(
url,
@@ -563,6 +569,7 @@ def pdf(
width,
height,
scale,
+ interactive,
print_background,
):
"""
@@ -584,13 +591,22 @@ def pdf(
if output is None:
output = filename_for_url(url, ext="pdf", file_exists=os.path.exists)
with sync_playwright() as p:
- context, browser_obj = _browser_context(p, auth)
- page = context.new_page()
- page.goto(url)
- if wait:
- time.sleep(wait / 1000)
- if javascript:
- _evaluate_js(page, javascript)
+ context, browser_obj = _browser_context(p, auth, interactive=interactive)
+ if interactive:
+ page = context.new_page()
+ page.goto(url)
+ context = page
+ click.echo(
+ "Hit <enter> to take the shot and close the browser window:", err=True
+ )
+ input()
+ else:
+ page = context.new_page()
+ page.goto(url)
+ if wait:
+ time.sleep(wait / 1000)
+ if javascript:
+ _evaluate_js(page, javascript)
kwargs = {
"landscape": landscape,
But when I run it I get this error:
% shot-scraper pdf -i simonwillison.net
Hit <enter> to take the shot and close the browser window:
Traceback (most recent call last):
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 625, in pdf
pdf = page.pdf(**kwargs)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 9274, in pdf
self._sync(
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
return task.result()
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 869, in pdf
encoded_binary = await self._channel.send("pdf", params)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
return await self.inner_send(method, params, False)
File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Page.printToPDF): Printing is not available
It looks like the problem is that save to PDF is only available in headless mode: https://stackoverflow.com/a/70937997/6083
PDF creation is only supported in headless mode.
My bad! Now we both know it...
By HTML MODE I meant possibility to save html of the scraped page....
You could do that using shot-scraper javascript
like this:
shot-scraper javascript datasette.io 'document.body.innerHTML' | jq -r > page.html
The | jq -r
bit is because without that you get back a JavaScript string with newlines converted to \n
and suchlike - piping through jq -r
turns that into a regular string which you can then save to a file.
Actually that only gets everything inside <body>
- if you want <html>
and downwards this seems to do the trick:
shot-scraper javascript datasette.io 'document.body.parentElement.outerHTML' | jq -r > page.html
Given how non-obvious this is I wonder if it does deserve having its own special feature?
Actually this is better:
shot-scraper javascript datasette.io 'document.documentElement.outerHTML' | jq -r
I realized that pattern doesn't give you the doctype.
If you want the doctype, there's a Playwright API that can do it: https://playwright.dev/python/docs/api/class-page#page-content
page.content()
Added in: v1.8
Gets the full HTML contents of the page, including the doctype.
This has convinced me that shot-scraper html
would be worth adding! I'll open a new issue for that.
I built that feature - documentation is here: https://shot-scraper.datasette.io/en/latest/html.html
Thank you so much!
I think it should support interactive mode for pdf as well... It does not right now...
BTW. Can we by any chance have html mode?