Open simonw opened 8 months ago
Which returns this text: Bad Signature
Meanwhile on https://github.com/simonw/datasette-atom
Possibly relevant:
Those talk about timeouts though, which I think isn't the same thing as this signature problem.
I'm going to force re-build the README for datasette-atom
to see if I can fix that one-off. Here's the logic that's causing it not to fetch a fresh copy: https://github.com/simonw/datasette.io/blob/55e8fb1e90c560f3f406afb7503fb7bf919ed058/build_directory.py#L178-L185
That did fix it:
https://datasette.io/plugins/datasette-atom
So I think the fix may be to change the logic about fetching README files so it fetches if the repo has been updated OR if the README was last fetched more than X days ago. Not sure what value to use for X.
Found some relevant documentation.
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-anonymized-urls
To host your images, GitHub uses the open-source project Camo. Camo generates an anonymous URL proxy for each file which hides your browser details and related information from other users. The URL starts
https://<subdomain>.githubusercontent.com/
, with different subdomains depending on how you uploaded the image.
https://github.blog/2014-01-28-proxying-user-images/
A while back, we started proxying all non-https images to avoid mixed-content warnings using a custom node server called camo. We're making a small change today and proxying HTTPS images as well.
Proxying these images will help protect your privacy: your browser information won't be leaked to other third party services.
https://github.com/atmos/camo was archived in April 2021.
Another clue:
curl -i 'https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67'
HTTP/2 403
cache-control: no-cache, no-store, private, must-revalidate
content-security-policy: default-src 'none'; img-src data:; style-src 'unsafe-inline'
content-type: text/plain; charset=utf-8
server: github-camo (2d97ca31)
strict-transport-security: max-age=31536000; includeSubDomains
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
x-github-request-id: F714:6D3F:387FC9:493449:65A6B820
accept-ranges: bytes
date: Tue, 16 Jan 2024 17:08:53 GMT
via: 1.1 varnish
x-served-by: cache-pao-kpao1770040-PAO
x-cache: MISS
x-cache-hits: 0
x-timer: S1705424933.064491,VS0,VE117
x-fastly-request-id: db6d3e2546b29d5a0aa986245ef0816997e820aa
timing-allow-origin: https://github.com
content-length: 14
Bad Signature
Note server: github-camo (2d97ca31)
.
I think GitHub upgraded from that Node camo
to this alternative written in Go at some point, because I found this in the source code for the Go one: https://github.com/cactus/go-camo/blob/11a57d9fcc47ad5f3de37b733a03f023edf4007c/pkg/camo/proxy.go#L102-L106
sURL, ok := encoding.DecodeURL(p.config.HMACKey, sigHash, encodedURL)
if !ok {
http.Error(w, "Bad Signature", http.StatusForbidden)
return
}
Having seen that code, my hunch is that GitHub changed their HMACKey
at some point which broke my previously cached images.
In the DB at the moment I store both the raw markdown and the rendered HTML: https://datasette.io/content/repos/209091256
That shows that using GitHub's markdown API doesn't produce the camo.
proxied URLs.
In my code I'm calling out to github-to-sqlite
which does this:
def fetch_readme(token, full_name, html=False):
headers = make_headers(token)
if html:
headers["accept"] = "application/vnd.github.VERSION.html"
url = "https://api.github.com/repos/{}/readme".format(full_name)
response = requests.get(url, headers=headers)
if response.status_code != 200:
return None
if html:
return rewrite_readme_html(response.text)
else:
return base64.b64decode(response.json()["content"]).decode("utf-8")
Switching to rendering through the Markdown API would fix this by removing camo.githubusercontent.com
entirely from how https://datasette.io/ works.
I could flip readme_html=False
and then have a separate Markdown rendering step.
Probably easiest to cut github-to-sqlite
out entirely here and instead do a fetch to https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-a-repository-readme to get the raw markdown, then send that to the https://api.github.com/markdown endpoint to render it. https://til.simonwillison.net/markdown/github-markdown-api
Pieces I need to fix this.
curl 'https://api.github.com/repos/simonw/datasette/readme' -H 'Accept: application/vnd.github.raw'
To get back the raw Markdown.
import httpx
body = """
# Example
This is example Markdown
"""
response = httpx.post(
"https://api.github.com/markdown",
json={
# mode=gfm would expand #13 issue links, provided you pass
# context=simonw/datasette too
"mode": "markdown",
"text": body,
},
headers=headers,
)
if response.status_code == 200:
markdown_as_html = response.text
Looking again at this code:
if repos_to_fetch_readme_for:
print("Fetching README for {}".format(repos_to_fetch_readme_for))
github_to_sqlite_repos.callback(
db_filename,
usernames=[],
auth="auth.json",
repo=repos_to_fetch_readme_for,
load=None,
readme=True,
readme_html=True,
)
Is it JUST fetching the READMEs, or is it also populating the database with other important information?
If it's pulling other key information too then I should leave that in there but rename repos_to_fetch_readme_for
to repos_to_fetch_details_for
- then set readme=False
and readme_html=
False` and then roll my own separate code to fetch and store the READMEs.
Here's the implementation of repos()
: https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/cli.py#L281-L312
def repos(db_path, usernames, auth, repo, load, readme, readme_html):
"Save repos owned by the specified (or authenticated) username or organization"
db = sqlite_utils.Database(db_path)
token = load_token(auth)
if load:
for loaded_repo in json.load(open(load)):
utils.save_repo(db, loaded_repo)
else:
if repo:
# Just these repos
for full_name in repo:
repo_id = utils.save_repo(db, utils.fetch_repo(full_name, token))
_repo_readme(db, token, repo_id, full_name, readme, readme_html)
else:
if not usernames:
usernames = [None]
for username in usernames:
for repo in utils.fetch_all_repos(username, token):
repo_id = utils.save_repo(db, repo)
_repo_readme(
db, token, repo_id, repo["full_name"], readme, readme_html
)
utils.ensure_db_shape(db)
def _repo_readme(db, token, repo_id, full_name, readme, readme_html):
if readme:
readme = utils.fetch_readme(token, full_name)
db["repos"].update(repo_id, {"readme": readme}, alter=True)
if readme_html:
readme_html = utils.fetch_readme(token, full_name, html=True)
db["repos"].update(repo_id, {"readme_html": readme_html}, alter=True)
I'm going to assume there are other reasons to call it and rename the variable to repos_to_fetch
.
I'm also going to drop that --force-fetch-readmes
flag because I don't know what I'd use it for.
Last question: how best to upgrade the existing records?
I'm worried about rate limits so I don't want to force update READMEs for everything
I think I'll add code which, every time the build script runs, picks the three oldest repos which have camo.githubusercontent.com
somewhere in their readme_html
(but not in their readme
) and adds them to the repos_to_fetch
pile.
Then in a few days time everything should be fixed.
Those build logs included:
2024-01-17T01:53:25.7033961Z Fixing HTML for ['simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7035207Z Fetching repo details for ['simonw/datasette-edit-templates', 'simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7036136Z Fetching README for simonw/datasette-edit-templates
2024-01-17T01:53:25.7036592Z Fetching README for simonw/csvs-to-sqlite
2024-01-17T01:53:25.7037182Z Fetching README for simonw/datasette-cluster-map
2024-01-17T01:53:25.7037686Z Fetching README for simonw/datasette-leaflet-geojson
But... while these pages have non-broken images they still seem to be proxied through camo.githubusercontent.com
:
https://datasette.io/plugins/datasette-atom