simonw / shot-scraper

A command-line utility for taking automated screenshots of websites
https://shot-scraper.datasette.io
Apache License 2.0
1.72k stars 78 forks source link

shot-scraper pdf -i #92

Closed honzajde closed 2 years ago

honzajde commented 2 years ago

I think it should support interactive mode for pdf as well... It does not right now...

BTW. Can we by any chance have html mode?

simonw commented 2 years ago

What would HTML mode do?

You can run shot-scraper against an HTML file on disk already, like this:

shot-scraper example.html
simonw commented 2 years ago

Unfortunately it doesn't look like it's possible to provide interactive mode for PDF printing.

I tried this prototype:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 04e4ef5..b5157b5 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -550,6 +550,12 @@ def javascript(
     type=click.FloatRange(min=0.1, max=2.0),
     help="Scale of the webpage rendering",
 )
+@click.option(
+    "-i",
+    "--interactive",
+    is_flag=True,
+    help="Interact with the page in a browser before taking the shot",
+)
 @click.option("--print-background", is_flag=True, help="Print background graphics")
 def pdf(
     url,
@@ -563,6 +569,7 @@ def pdf(
     width,
     height,
     scale,
+    interactive,
     print_background,
 ):
     """
@@ -584,13 +591,22 @@ def pdf(
     if output is None:
         output = filename_for_url(url, ext="pdf", file_exists=os.path.exists)
     with sync_playwright() as p:
-        context, browser_obj = _browser_context(p, auth)
-        page = context.new_page()
-        page.goto(url)
-        if wait:
-            time.sleep(wait / 1000)
-        if javascript:
-            _evaluate_js(page, javascript)
+        context, browser_obj = _browser_context(p, auth, interactive=interactive)
+        if interactive:
+            page = context.new_page()
+            page.goto(url)
+            context = page
+            click.echo(
+                "Hit <enter> to take the shot and close the browser window:", err=True
+            )
+            input()
+        else:
+            page = context.new_page()
+            page.goto(url)
+            if wait:
+                time.sleep(wait / 1000)
+            if javascript:
+                _evaluate_js(page, javascript)

         kwargs = {
             "landscape": landscape,

But when I run it I get this error:

% shot-scraper pdf -i simonwillison.net
Hit <enter> to take the shot and close the browser window:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 625, in pdf
    pdf = page.pdf(**kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 9274, in pdf
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 869, in pdf
    encoded_binary = await self._channel.send("pdf", params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Page.printToPDF): Printing is not available

It looks like the problem is that save to PDF is only available in headless mode: https://stackoverflow.com/a/70937997/6083

PDF creation is only supported in headless mode.

honzajde commented 2 years ago

My bad! Now we both know it...

By HTML MODE I meant possibility to save html of the scraped page....

simonw commented 2 years ago

You could do that using shot-scraper javascript like this:

shot-scraper javascript datasette.io 'document.body.innerHTML' | jq -r > page.html

The | jq -r bit is because without that you get back a JavaScript string with newlines converted to \n and suchlike - piping through jq -r turns that into a regular string which you can then save to a file.

simonw commented 2 years ago

Actually that only gets everything inside <body> - if you want <html> and downwards this seems to do the trick:

shot-scraper javascript datasette.io 'document.body.parentElement.outerHTML' | jq -r > page.html

Given how non-obvious this is I wonder if it does deserve having its own special feature?

simonw commented 2 years ago

Actually this is better:

shot-scraper javascript datasette.io 'document.documentElement.outerHTML' | jq -r
simonw commented 2 years ago

I realized that pattern doesn't give you the doctype.

If you want the doctype, there's a Playwright API that can do it: https://playwright.dev/python/docs/api/class-page#page-content

page.content()

Added in: v1.8

Gets the full HTML contents of the page, including the doctype.

This has convinced me that shot-scraper html would be worth adding! I'll open a new issue for that.

simonw commented 2 years ago

I built that feature - documentation is here: https://shot-scraper.datasette.io/en/latest/html.html

honzajde commented 2 years ago

Thank you so much!