pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
455 stars 141 forks source link

Implement PEP 691 JSON Simple Index Support #1138

Closed cooperlees closed 1 year ago

cooperlees commented 2 years ago

Add logic for bandersnatch to save both the HTML and JSON simple index files. This will allow people to serve both the HTML and JSON in their mirrors.

We should also update docs + give an example way to serve based on request headers (conneg) as outlined in PEP691.

dstufft commented 2 years ago

My suggestions here:

Write out 3 files index.html, index.v1_html, and index.v1_json. These will map to:

Ext Content Type
.html text/html
.v1_html application/vnd.pypi.simple.v1+html
.v1_json application/vnd.pypi.simple.v1+json

For Apache, if you have mod_negotiation enabled you can use a .htaccess that looks like this inside of the /simple/ directory:

Options -Indexes +Multiviews

DirectoryIndex index

AddType application/vnd.pypi.simple.v1+json v1_json
AddType application/vnd.pypi.simple.v1+html v1_html

This will:

You can use this in a Docker container using the httpd docker container, but it requires modifying the built in config to enable mod_negotiation and set it to read .htaccess files. A Dockerfile that implements that would look like:

FROM httpd

RUN echo '\n\
    LoadModule negotiation_module modules/mod_negotiation.so\n\
    \n\
    <Directory "/usr/local/apache2/htdocs">\n\
    AllowOverride All\n\
    </Directory>' >> /usr/local/apache2/conf/httpd.conf

This can be ran using docker run --rm -dit -p 8080:80 -v PATHTOBANDERWEB:/usr/local/apach2/htdocs/ theimagebuiltabove, with the .htaccess added.

Alternatively, you can use nginx. The adapted banderx config looks something like this:

daemon off;
user nginx;
worker_processes auto;
error_log /dev/stderr info;
pid /run/nginx.pid;

events {
    worker_connections 2048;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /dev/stdout  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   69;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    map $http_accept $mirror_suffix {
        default ".html";

        "~*application/vnd\.pypi\.simple\.latest\+json" ".v1_json";
        "~*application/vnd\.pypi\.simple\.latest\+html" ".v1_html";

        "~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
        "~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";

        "~*text/html" ".html";
    }

    map $arg_format $mirror_suffix_via_url {
        "application/vnd.pypi.simple.latest+json" ".v1_json";
        "application/vnd.pypi.simple.latest+html" ".v1_html";

        "application/vnd.pypi.simple.v1+json" ".v1_json";
        "application/vnd.pypi.simple.v1+html" ".v1_html";

        "text/html" ".html";
    }

    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  banderx;
        root         /data/pypi/web;
        autoindex    on;
        charset      utf-8;

        location /simple/ {
            # Uncomment to support hash_index = true bandersnatch mirrors
            # rewrite ^/simple/([^/])([^/]*)/$ /simple/$1/$1$2/ last;
            # rewrite ^/simple/([^/])([^/]*)/([^/]+)$/ /simple/$1/$1$2/$3 last;

            index index$mirror_suffix_via_url index$mirror_suffix;

            types {
                application/vnd.pypi.simple.v1+json v1_json;
                application/vnd.pypi.simple.v1+html v1_html;
                text/html html;
            }

            # Uncomment to support conneg for files other than
            # index, so that /simple/foo will map to /simple/foo.html,
            # /simple/foo.v1_html, or /simple/foo.v1_json based on the
            # Accept header.
            # try_files $uri$mirror_suffix $uri $uri/ =404;
        }

        # Let us set the correct mime type for all the JSON
        location /json/ {
            default_type        application/json;
        }
        location /pypi/ {
            default_type        application/json;
        }

        error_page 404 /404.html;
        location = /40x.html {
        }

        error_page 500 502 503 504 /50x.html;
        location = /50x.html {
        }
    }
}

The big differences between Apache and Nginx here are:

  1. Apache actually implements conneg, so it will read and interpret the Accept header and select the correct content type based on that.
    • This means that clients can control which content type they prefer, while still listing all of the content types they support using the ;q=N parameter to indicate relative preference.
  2. Nginx does not actually implement conneg, it's just faking support for it by populating the $mirror_suffix variable by doing regex testing against the Accept header, with a default fallback to .html.
    • This means that it isn't going to support the q=N parameter for clients to express their preference of which content types they prefer, out of the ones they support. This is allowed under conneg, Servers are not required to return the content type the client most prefers, but it's nice if they do since the client presumably has a reason to prefer it.
    • There's one possible bug here, ;q=0 typically disables the content type, but since the nginx config doesn't actually parse/understand the Accept header, it will ignore that qvalue as well. Using q=0 is pretty rare, so I don't think it's a particularly big deal.
  3. Nginx supports the latest aliases for our custom content types, Apache does not because Apache's conneg doesn't let us return a different content type than gets matched in the Accept header, while Nginx does.
    • Possibly you could do this with mod_rewrite or something, I'm not sure.
  4. When conneg fails, Apache defaults to whichever version is the smallest response, Nginx defaults to whatever version is mentioned as the default in the map (in the above case, it's .html).
    • Apache's behavior could be weird, as different packages will default to html or json depending on which one happens to be smaller. This shouldn't be a big deal for pip, since old versions of pip asked for text/html and the PEP 691-ified pip asks for all 3.
    • It's possible there's some trick with mod_rewrite that would let you set a default that would be used when there isn't an Accept header, I'm not sure.
  5. The Nginx option supports the ?format= query parameter, which will override the Accept header if it's been specified.
    • This may be possible to replicate with mod_rewrite, I'm not sure.

Personally, I would recommend sticking with nginx for banderx.

I don't think the fact the Nginx's conneg support is not really actually implemented as conneg, but instead some basic regex matching will actually matter for anyone unless they're purposely trying to do weird things, but I think the ability to specifically pick which version is the default is a really nice thing as it lets a mirror operator decide what level of compatibility they want (my above config chooses max compatability) and I think that the extra features supported by the nginx config (latest version, the ?format= url parameter) are nice to have as well.

On the other hand, I think that Apache's behavior of defaulting to whatever response is smallest is nice for saving bandwidth, but I think it's kind of weird that different URLs under /simple/ may end up with randomly different default options.

dstufft commented 2 years ago

One additional thing:

The above assumes that bandersnatch is going to swap out from writing just index.html files, to writing the 3 files mentioned above alongside each other, which makes a lot of sense for people who want a single URL to support all of the content types available.

Some people may want to not rely on conneg, and have different URLs for different content types. I think bandersnatch could support this pretty easily using two options:

  1. If a configuration format is introduced to filter the content types that bandersnatch will emit, then obviously you could just run multiple copies of bandersnatch with different content types filtered.
  2. Support an option to store the different content types in different root directories, so instead of something like /data/pypi/web/simple/pkgname/, if this option was turned on you would do /data/pypi/web/simple/html/pkgname/, /data/pypi/web/simple/v1+json/pkgname/, etc.
    • Using this would then mean doing something like pip install -i https://example.com/simple/v1+json/.
    • This might be YAGNI, maybe nobody actually wants to do this. Just a random idea that popped into my head that is supported by PEP691, that people might want to do.
cooperlees commented 1 year ago

1154 + #1161