simonw / datasette.io

The official project website for Datasette
https://datasette.io
92 stars 21 forks source link

Broken images on some plugin pages #156

Open simonw opened 8 months ago

simonw commented 8 months ago

https://datasette.io/plugins/datasette-atom

CleanShot 2024-01-16 at 08 37 55@2x

simonw commented 8 months ago

PyPI Changelog link is https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67

Which returns this text: Bad Signature

Meanwhile on https://github.com/simonw/datasette-atom

CleanShot 2024-01-16 at 08 38 45@2x

simonw commented 8 months ago

Possibly relevant:

Those talk about timeouts though, which I think isn't the same thing as this signature problem.

simonw commented 8 months ago

I'm going to force re-build the README for datasette-atom to see if I can fix that one-off. Here's the logic that's causing it not to fetch a fresh copy: https://github.com/simonw/datasette.io/blob/55e8fb1e90c560f3f406afb7503fb7bf919ed058/build_directory.py#L178-L185

simonw commented 8 months ago

That did fix it:

https://datasette.io/plugins/datasette-atom

CleanShot 2024-01-16 at 09 03 09@2x

So I think the fix may be to change the logic about fetching README files so it fetches if the repo has been updated OR if the README was last fetched more than X days ago. Not sure what value to use for X.

simonw commented 8 months ago

Found some relevant documentation.

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-anonymized-urls

To host your images, GitHub uses the open-source project Camo. Camo generates an anonymous URL proxy for each file which hides your browser details and related information from other users. The URL starts https://<subdomain>.githubusercontent.com/, with different subdomains depending on how you uploaded the image.

https://github.blog/2014-01-28-proxying-user-images/

A while back, we started proxying all non-https images to avoid mixed-content warnings using a custom node server called camo. We're making a small change today and proxying HTTPS images as well.

Proxying these images will help protect your privacy: your browser information won't be leaked to other third party services.

https://github.com/atmos/camo was archived in April 2021.

simonw commented 8 months ago

Another clue:

curl -i 'https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67'
HTTP/2 403 
cache-control: no-cache, no-store, private, must-revalidate
content-security-policy: default-src 'none'; img-src data:; style-src 'unsafe-inline'
content-type: text/plain; charset=utf-8
server: github-camo (2d97ca31)
strict-transport-security: max-age=31536000; includeSubDomains
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
x-github-request-id: F714:6D3F:387FC9:493449:65A6B820
accept-ranges: bytes
date: Tue, 16 Jan 2024 17:08:53 GMT
via: 1.1 varnish
x-served-by: cache-pao-kpao1770040-PAO
x-cache: MISS
x-cache-hits: 0
x-timer: S1705424933.064491,VS0,VE117
x-fastly-request-id: db6d3e2546b29d5a0aa986245ef0816997e820aa
timing-allow-origin: https://github.com
content-length: 14

Bad Signature

Note server: github-camo (2d97ca31).

simonw commented 8 months ago

I think GitHub upgraded from that Node camo to this alternative written in Go at some point, because I found this in the source code for the Go one: https://github.com/cactus/go-camo/blob/11a57d9fcc47ad5f3de37b733a03f023edf4007c/pkg/camo/proxy.go#L102-L106

    sURL, ok := encoding.DecodeURL(p.config.HMACKey, sigHash, encodedURL)
    if !ok {
        http.Error(w, "Bad Signature", http.StatusForbidden)
        return
    }
simonw commented 8 months ago

Having seen that code, my hunch is that GitHub changed their HMACKey at some point which broke my previously cached images.

In the DB at the moment I store both the raw markdown and the rendered HTML: https://datasette.io/content/repos/209091256

CleanShot 2024-01-16 at 09 41 57@2x

That shows that using GitHub's markdown API doesn't produce the camo. proxied URLs.

In my code I'm calling out to github-to-sqlite which does this:

https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/utils.py#L785-L796

def fetch_readme(token, full_name, html=False):
    headers = make_headers(token)
    if html:
        headers["accept"] = "application/vnd.github.VERSION.html"
    url = "https://api.github.com/repos/{}/readme".format(full_name)
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    if html:
        return rewrite_readme_html(response.text)
    else:
        return base64.b64decode(response.json()["content"]).decode("utf-8")

Switching to rendering through the Markdown API would fix this by removing camo.githubusercontent.com entirely from how https://datasette.io/ works.

simonw commented 8 months ago

https://github.com/simonw/datasette.io/blob/428d631c092483326748432d4c3102bff2101320/build_directory.py#L187-L197

I could flip readme_html=False and then have a separate Markdown rendering step.

simonw commented 8 months ago

Probably easiest to cut github-to-sqlite out entirely here and instead do a fetch to https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-a-repository-readme to get the raw markdown, then send that to the https://api.github.com/markdown endpoint to render it. https://til.simonwillison.net/markdown/github-markdown-api

simonw commented 8 months ago

Pieces I need to fix this.

curl 'https://api.github.com/repos/simonw/datasette/readme' -H 'Accept: application/vnd.github.raw'

To get back the raw Markdown.

import httpx

body = """
# Example

This is example Markdown
"""

response = httpx.post(
    "https://api.github.com/markdown",
    json={
        # mode=gfm would expand #13 issue links, provided you pass
        # context=simonw/datasette too
        "mode": "markdown",
        "text": body,
    },
    headers=headers,
)
if response.status_code == 200:
    markdown_as_html = response.text
simonw commented 8 months ago

Looking again at this code:

    if repos_to_fetch_readme_for:
        print("Fetching README for {}".format(repos_to_fetch_readme_for))
        github_to_sqlite_repos.callback(
            db_filename,
            usernames=[],
            auth="auth.json",
            repo=repos_to_fetch_readme_for,
            load=None,
            readme=True,
            readme_html=True,
        )

Is it JUST fetching the READMEs, or is it also populating the database with other important information?

If it's pulling other key information too then I should leave that in there but rename repos_to_fetch_readme_for to repos_to_fetch_details_for - then set readme=False and readme_html=False` and then roll my own separate code to fetch and store the READMEs.

simonw commented 8 months ago

Here's the implementation of repos(): https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/cli.py#L281-L312

def repos(db_path, usernames, auth, repo, load, readme, readme_html):
    "Save repos owned by the specified (or authenticated) username or organization"
    db = sqlite_utils.Database(db_path)
    token = load_token(auth)
    if load:
        for loaded_repo in json.load(open(load)):
            utils.save_repo(db, loaded_repo)
    else:
        if repo:
            # Just these repos
            for full_name in repo:
                repo_id = utils.save_repo(db, utils.fetch_repo(full_name, token))
                _repo_readme(db, token, repo_id, full_name, readme, readme_html)
        else:
            if not usernames:
                usernames = [None]
            for username in usernames:
                for repo in utils.fetch_all_repos(username, token):
                    repo_id = utils.save_repo(db, repo)
                    _repo_readme(
                        db, token, repo_id, repo["full_name"], readme, readme_html
                    )
    utils.ensure_db_shape(db)

def _repo_readme(db, token, repo_id, full_name, readme, readme_html):
    if readme:
        readme = utils.fetch_readme(token, full_name)
        db["repos"].update(repo_id, {"readme": readme}, alter=True)
    if readme_html:
        readme_html = utils.fetch_readme(token, full_name, html=True)
        db["repos"].update(repo_id, {"readme_html": readme_html}, alter=True)

I'm going to assume there are other reasons to call it and rename the variable to repos_to_fetch.

simonw commented 8 months ago

I'm also going to drop that --force-fetch-readmes flag because I don't know what I'd use it for.

simonw commented 8 months ago

Last question: how best to upgrade the existing records?

I'm worried about rate limits so I don't want to force update READMEs for everything

I think I'll add code which, every time the build script runs, picks the three oldest repos which have camo.githubusercontent.com somewhere in their readme_html (but not in their readme) and adds them to the repos_to_fetch pile.

Then in a few days time everything should be fixed.

simonw commented 8 months ago

Those build logs included:

2024-01-17T01:53:25.7033961Z Fixing HTML for ['simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7035207Z Fetching repo details for ['simonw/datasette-edit-templates', 'simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7036136Z Fetching README for simonw/datasette-edit-templates
2024-01-17T01:53:25.7036592Z Fetching README for simonw/csvs-to-sqlite
2024-01-17T01:53:25.7037182Z Fetching README for simonw/datasette-cluster-map
2024-01-17T01:53:25.7037686Z Fetching README for simonw/datasette-leaflet-geojson

But... while these pages have non-broken images they still seem to be proxied through camo.githubusercontent.com: