pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

Issues serving via S3 static website #1611

Open cs1jmc opened 10 months ago

cs1jmc commented 10 months ago

I've run into an issue when trying to then pull packages from a bucket backed static site, but can't tell if the issue is my config a change in static sites behaviour (and how pip deals with it)

WARNING: Skipping page http://<bucket name.region>.amazonaws.com/mirror/web/simple/pillow/ because the GET request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
ERROR: Could not find a version that satisfies the requirement pillow (from versions: none)
ERROR: No matching distribution found for pillow

I notice that curling /web/simple/<package>/ returns a 302 which leads me to think this is more of a static site / pip handling issue that would affect the bandersnatch implementation:

<html>
<head><title>302 Moved Temporarily</title></head>
<body>
<h1>302 Moved Temporarily</h1>
<ul>
<li>Code: Found</li>
<li>Message: Resource Found</li>
<li>RequestId:</li>
<li>HostId:</li>
</ul>
<hr/>
</body>
</html>

My current deploymend of bandersnatch uses this template below as the base for the configuraiton:

[mirror]

directory = /{{ s3_bucket_name }}/{{ s3_file_prefix }}
storage-backend = s3
diff-file = /{{ s3_bucket_name}}/{{ s3_file_prefix }}/{{ s3_diff_file }}

json = false
master = https://pypi.org
timeout = 60
hash-index = false
workers = 6
stop-on-error = false
delete-packages = true

[s3]

region_name = {{ aws_region }}
aws_access_key_id = {{ s3_access_key }}
aws_secret_access_key = {{ s3_secret_key }}
endpoint_url = {{ s3_endpoint_url }}
signature_version = s3v4

[plugins]
enabled =
    exclude_platform
    allowlist_project

[blocklist]
platforms =
    macos
    freebsd

[allowlist]
packages =
    {%+ for package in package_allowlist -%}{{ package }}
    {% endfor %}

I'm wondering if this is misconfig on my part or maybe recent change on AWS side that just breaks this design.

cooperlees commented 10 months ago

This is definitely a serving configuration issue. You need to make the Content-Type: s3 HTML headers send text/html if you're serving a index.html or application/vnd.pypi.simple.v1+json if you're seeing the json file to make pip happy ...

My quick search (linked above) says there is no default and you're somehow sending Content-Type: binary/octet-stream. So correcting that should help fix the issue.

I'm happy to take documentation updates to https://bandersnatch.readthedocs.io/en/latest/storage_options.html#amazon-s3 - Source file if you feel our docs are lacking. I've sadly never setup a S3 based mirror so can not help much more here.

cs1jmc commented 10 months ago

I've taken a second look at things with a fresh pair of eyes. Think you pointed in the right direction with the Content-Type.

From what I can tell the bandersnatch s3 plugin isn't specifying a Mime type when doing a PutObject to S3, which results in AWS giving the object the default of binary/octet-stream:

aws s3api head-object --bucket <bucketname> --key web/simple/index.html
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-11-27T12:24:18+00:00",
    "ContentLength": 422,
    "ETag": omitted,
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

From some surface level digging it looks like S3Path is being used to get the files to S3 and there's conversation about passing the Content-Type as a parameter in an existing issue:

https://github.com/liormizr/s3path/issues/83#issuecomment-869729917

I sadly lack the talent and knowledge on bandersnatch to know how to go about fixing things. (If what I mention sounds right)

cooperlees commented 10 months ago

Ahh, it seems if it is set @ upload / write time, then this is indeed a bandersnatch bug. Nice find.

I'm asking on the issue if there are plans for a friendlier API and how do we edit existing files ContentType ...

LeoQuote commented 10 months ago

You can use a CDN to provide service, which could be cheaper and content-type can also be changed

Use https://github.com/pottava/aws-s3-proxy and nginx to set content-type if you're using this for internal use only.

inthecloud247 commented 10 months ago

I also encountered this bug in the s3 server... until it's fixed I had to do a recursive fix of the content-types of the index.html pages in my bucket:

aws s3 cp \
       s3://MY_BUCKET/data/web/simple/ \
       s3://MY_BUCKET/data/web/simple/ \
       --exclude '*' \
       --include '*.html' \
       --no-guess-mime-type \
       --content-type="text/html" \
       --metadata-directive="REPLACE" \
       --recursive