simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.56k stars 691 forks source link

Incorrect URLs when served behind a proxy with base_url set #838

Closed tsibley closed 2 years ago

tsibley commented 4 years ago

I'm running datasette serve --config base_url:/foo/ …, proxying to it with this Apache config:

    ProxyPass /foo/ http://localhost:8001/ 
    ProxyPassReverse /foo/ http://localhost:8001/ 

and then accessing it via https://example.com/foo/.

Although many of the URLs in the pages are correct (presumably because they either use absolute paths which include base_url or relative paths), the faceting and pagination links still use fully-qualified URLs pointing at http://localhost:8001.

I looked into this a little in the source code, and it seems to be an issue anywhere request.url or request.path is used, as these contain the values for the request between the frontend (Apache) and backend (Datasette) server. Those properties are primarily used via the path_with_… family of utility functions and the Datasette.absolute_url method.

simonw commented 4 years ago

Have you tried this without the ProxyPassReverse directive? I'm worried that might be confusing Datasette.

This is the test I used to ensure this feature works - it scrapes all of the links on a bunch of different pages. Could it be missing something here?

https://github.com/simonw/datasette/blob/647c5ff0f3e8140f40d7f41f0874ce4e1f4df65c/tests/test_html.py#L1233-L1274

tsibley commented 4 years ago

Hmm, I haven't tried removing ProxyPassReverse, but it doesn't touch the HTML, which is the issue I'm seeing. You can read the documentation here. ProxyPassReverse is a standard directive when proxying with Apache. I've used it dozens of times with other applications.

Looking a little more at the code, I think the issue here is that the behaviour of base_url makes sense when Datasette is mounted at a path within a larger application, but not when HTTP requests are being proxied to it.

In a mount situation, it is perfectly fine to construct URLs reusing the domain and path from the request. In a proxy situation, it never is, as the domain and path in the request are not the domain and path that the non-proxy client actually needs to use. That is, links which include the Apache → Datasette request origin, localhost:8001, instead of the browser → Apache request origin, example.com, will be broken.

The tests you pointed to also reflect this in two ways:

  1. They strip a leading http://localhost, allowing such URLs in the facet links to pass, but inclusion of that in a proxy situation would mean the URL is broken.

  2. The test client emits direct ASGI events instead of actual proxied HTTP requests. The headers of these ASGI events don't reflect the way an HTTP proxy works; instead they pass through the original request path which contains base_url. This works because Datasette responds to requests equivalently at either /… or /{base_url}/…, which makes some sense in a mount situation but is unconventional (albeit workable) for a proxied app.

Apps that support being proxied automatically support being mounted, but apps that only support being mounted don't automatically support being proxied.

ChristopherWilks commented 4 years ago

I also am seeing the same issue with an Apache setup (same even w/o ProxyPassReverse, though I typically use it as @tsibley stated).

But also want to say thanks for a great tool (this issue not withstanding)!

tballison commented 4 years ago

But also want to say thanks for a great tool

+1!

simonw commented 4 years ago

Tracking ticket: #1023

simonw commented 4 years ago

OK, I've made a ton of improvements to how the base_url setting works - see tickets linked from #1023. I've just pushed out an alpha release with those changes in it: https://github.com/simonw/datasette/releases/tag/0.51a0

@tsibley @tballison @ChristopherWilks I'd really appreciate your help testing this alpha!

You can install it with:

pip install datasette==0.51a0

It should work with just ProxyPass, without needing the ProxyPassReverse setting.

psychemedia commented 4 years ago

I'm trying to run something behind a MyBinder proxy, but seem to have something set up incorrectly and not sure what the fix is?

I'm starting datasette with jupyter-server-proxy setup:

# __init__.py
def setup_nbsearch():

    return {
        "command": [
            "datasette",
            "serve",
            f"{_NBSEARCH_DB_PATH}",
            "-p",
            "{port}",
            "--config",
            "base_url:{base_url}nbsearch/"
        ],
        "absolute_url": True,
        # The following needs a the labextension installing.
        # eg in postBuild: jupyter labextension install jupyterlab-server-proxy
        "launcher_entry": {
            "enabled": True,
            "title": "nbsearch",
        },
    }

where the base_url gets automatically populated by the server-proxy. I define the loaders as:

# __init__.py
from datasette import hookimpl

@hookimpl
def extra_css_urls(database, table, columns, view_name, datasette):
    return [
        "/-/static-plugins/nbsearch/prism.css",
        "/-/static-plugins/nbsearch/nbsearch.css",
    ]

but these seem to also need a base_url prefix set somehow?

Currently, the generated HTML loads properly but internal links are incorrect; eg they take the form <link rel="stylesheet" href="/-/static-plugins/nbsearch/prism.css"> which resolves to eg https://notebooks.gesis.org/hub/-/static-plugins/nbsearch/prism.css rather than required URL of form https://notebooks.gesis.org/binder/jupyter/user/ouseful-testing-nbsearch-0fx1mx67/nbsearch/-/static-plugins/nbsearch/prism.css.

The main css is loaded correctly: <link rel="stylesheet" href="/binder/jupyter/user/ouseful-testing-nbsearch-0fx1mx67/nbsearch/-/static/app.css?404439">

simonw commented 4 years ago

OK, this should be working now. You can use the datasette.urls.static_plugins() method to generate the correct URLs in the extra_css_urls plugin hook: https://docs.datasette.io/en/latest/internals.html#datasette-urls

psychemedia commented 4 years ago

Thanks; just a note that the datasette.urls.static(path) and datasette.urls.static_plugins(plugin_name, path) items both seem to be repeated and appear in the docs twice?

tsibley commented 3 years ago

@simonw Unfortunately this issue as I reported it is not actually solved in version 0.55.

Every link which is returned by the Datasette.absolute_url method is still wrong, because it uses the request URL as the base. This still includes the suggested facet links and pagination links.

What I wrote originally still stands:

Although many of the URLs in the pages are correct (presumably because they either use absolute paths which include base_url or relative paths), the faceting and pagination links still use fully-qualified URLs pointing at http://localhost:8001.

I looked into this a little in the source code, and it seems to be an issue anywhere request.url or request.path is used, as these contain the values for the request between the frontend (Apache) and backend (Datasette) server. Those properties are primarily used via the path_with_… family of utility functions and the Datasette.absolute_url method.

Would you prefer to re-open this issue or have me create a new one?

simonw commented 3 years ago

Let's reopen this.

simonw commented 3 years ago

The biggest challenge here I think is to replicate the exact situation here this happens in a Python unit test. The fix should be easy once we have a test in place.

tsibley commented 3 years ago

Nod. The problem with the tests is that they're ignoring the origin (hostname, port) of links. In a reverse proxy situation, the frontend request origin is different than the backend request origin. The problem is Datasette generates links with the backend request origin.

tsibley commented 3 years ago

I think this could be solved by one of:

  1. Stop generating absolute URLs, e.g. ones that include an origin. Relative URLs with absolute paths are fine, as long as they take base_url into account (as they do now, yay!).
  2. Extend base_url to include the expected frontend origin, and then use that information when generating absolute URLs.
  3. Document which HTTP headers the reverse proxy should set (e.g. the X-Forwarded-* family of conventional headers) to pass the frontend origin information to Datasette, and then use that information when generating absolute URLs.

Option 1 seems like the easiest to me, if you can get away with never having to generate an absolute URL.