simonw / google-drive-to-sqlite

Create a SQLite database containing metadata from Google Drive
https://datasette.io/tools/google-drive-to-sqlite
Apache License 2.0
153 stars 13 forks source link

Retry once (or more?) on any TransportError #18

Closed simonw closed 2 years ago

simonw commented 2 years ago

Got this exception:

  File "/Users/simon/Dropbox/Development/google-drive-to-sqlite/google_drive_to_sqlite/utils.py", line 79, in get
    response = httpx.get(url, params=params, headers=headers, timeout=self.timeout)
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_api.py", line 189, in get
    return request(
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_api.py", line 100, in request
    return client.request(
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 802, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 889, in send
    response = self._send_handling_auth(
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 917, in _send_handling_auth
    response = self._send_handling_redirects(
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 954, in _send_handling_redirects
    response = self._send_single_request(request)
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 990, in _send_single_request
    response = transport.handle_request(request)
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
    with map_httpcore_exceptions():
  File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

Would be good to retry once if this happens.

simonw commented 2 years ago

https://www.python-httpx.org/exceptions/#the-exception-hierarchy shows the exception hierarchy:

image
simonw commented 2 years ago

I want to retry on any form of TransportError i think - no point retrying a DecodingError or a TooManyRedirects error.

simonw commented 2 years ago

Sadly it looks like httpx itself has decided not to implement retry logic, so I need to build this myself:

simonw commented 2 years ago

While testing this I'm going to want to see if any transport errors have occurred - I think I'll add a -v/--verbose flag to the google-drive-to-sqlite files command.

simonw commented 2 years ago

I'm only going to retry GET, I won't retry POST.

simonw commented 2 years ago

Now manually testing this by running:

google-drive-to-sqlite files --folder 1E6Zg2X2bjjtPzVfX8YqdXZDCoB3AVA7i --nl --verbose > all-files.json-nl.txt

And keeping an eye on it while it runs with:

watch 'wc -l all-files.json-nl.txt && ls -lah all-files.json-nl.txt'

Started it running at 4:31pm.

simonw commented 2 years ago

It's at 37223 all-files.json-nl.txt and 49MB now, 25 minutes after starting.

simonw commented 2 years ago

That actually worked! 162M file resulted, with no errors.

simonw commented 2 years ago

Now running this to see what happens:

 time google-drive-to-sqlite files all-files.db --import-nl all-files.json-nl.txt
43.24s user 94.07s system 71% cpu 3:13.06 total

Produced a 80MB SQLite file, thanks presumably to the owners data being de-duplicated.

simonw commented 2 years ago
image

I'm suspicious of the 14,100 rows in the drive_users table.

simonw commented 2 years ago

Confirmed, something went very wrong there:

image

88 rows where permissionId is not null, 14,012 rows where permissionId is null.

simonw commented 2 years ago

Fixed that bug:

image