simonw / shot-scraper

A command-line utility for taking automated screenshots of websites
https://shot-scraper.datasette.io
Apache License 2.0
1.67k stars 73 forks source link

Protocol error (Network.getResponseBody): No resource with given identifier found #89

Closed simonw closed 2 years ago

simonw commented 2 years ago

Got this error when running:

shot-scraper https://lite.datasette.io/ --wait-for 'document.querySelector("h2")' --log-requests - | tee /tmp/datasette-lite.txt
Exception in callback SyncBase._sync.<locals>.callback(<Task finishe...ifier found')>) at /Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104
handle: <Handle SyncBase._sync.<locals>.callback(<Task finishe...ifier found')>) at /Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 105, in callback
    g_self.switch()
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 122, in <lambda>
    lambda params: self._on_response(
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 397, in _on_response
    page.emit(Page.Events.Response, response)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_base.py", line 113, in emit
    handled = self._call_handlers(event, args, kwargs)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_base.py", line 96, in _call_handlers
    self._emit_run(f, args, kwargs)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_asyncio.py", line 42, in _emit_run
    self.emit('error', exc)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_base.py", line 116, in emit
    self._emit_handle_potential_error(event, args[0] if args else None)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_base.py", line 86, in _emit_handle_potential_error
    raise error
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/pyee/_asyncio.py", line 40, in _emit_run
    coro = f(*args, **kwargs)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 88, in wrapper_func
    return handler(
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/shot_scraper/cli.py", line 734, in on_response
    "size": len(response.body()),
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 574, in body
    self._sync("response.body", self._impl_obj.body())
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_network.py", line 375, in body
    binary = await self._channel.send("body")
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Network.getResponseBody): No resource with given identifier found

This was logged out a bunch of times, even though the command itself ran to completion.

I think this is likely caused by the new log requests feature from:

simonw commented 2 years ago

Probably this code - I think response.body() is breaking: https://github.com/simonw/shot-scraper/blob/bf34a76cb0a1d2e34ae4e47440d14682f3942513/shot_scraper/cli.py#L728-L734

simonw commented 2 years ago

Added this debugging code:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index d104725..f5636a8 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -726,6 +726,12 @@ def take_shot(
         if log_requests:

             def on_response(response):
+                try:
+                    body = response.body()
+                except Exception as ex:
+                    print(ex)
+                    print(response.url)
+                    return
                 log_requests.write(
                     json.dumps(
                         {

And got this:

Protocol error (Network.getResponseBody): No resource with given identifier found
https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.js
Protocol error (Network.getResponseBody): No resource with given identifier found
https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.asm.js
...
simonw commented 2 years ago

https://github.com/puppeteer/puppeteer/issues/2258#issuecomment-380647459 says "resources get dumped after page commits navigation" - so presumably what's happening here is that a page navigation has occurred which clears those resources from memory before my Python code gets a chance to call .body() on them.

simonw commented 2 years ago

My hunch is that it's a lot harder to reliably access the size of the resource than I had expected.

simonw commented 2 years ago

I'm going to try my best, but return "size": null if the resource body size could not be calculated.

I'll mention this in the documentation.

simonw commented 2 years ago

This seems to do the right thing:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index d104725..a19e878 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -726,12 +726,20 @@ def take_shot(
         if log_requests:

             def on_response(response):
+                try:
+                    body = response.body()
+                    size = len(body)
+                except Error as ex:
+                    if "Network.getResponseBody" in ex.message:
+                        size = None
+                    else:
+                        raise
                 log_requests.write(
                     json.dumps(
                         {
                             "method": response.request.method,
                             "url": response.url,
-                            "size": len(response.body()),
+                            "size": size,
                             "timing": response.request.timing,
                         }
                     )
simonw commented 2 years ago

Updated documentation: https://github.com/simonw/shot-scraper/blob/31bc975ff860b96c7533e98c270536b37f7d46e1/docs/screenshots.md#logging-all-requests