pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Consider the role of our internal bandersnatch mirror, and if it makes sense to continue to use it #8569

Open dstufft opened 4 years ago

dstufft commented 4 years ago

We currently have an infrastructure where we have Fastly sitting in front of three origin servers:

Generally speaking the happy path is that files go directly from Fastly to S3, and /simple/ the happy path is from Fastly to our Warehouse cluster.

However, if our Warehouse cluster is down for some reason, then Fastly is configured to hit our internal Bandersnatch mirror, to keep pip install working.

This made a lot of sense when we initially deployed because pip install foo depended on Warehouse being up, and this meant even in the case Warehouse was down, that pip install would still work.

However, in the TUF work, we're now looking to move from generating the /simple/ files on demand, to pregenerating them and storing them inside of an object store. This means that hypothetically we could serve the /simple/ index similarly to how we serve the actual files, having Fastly directly contact the object store.

This raises the question that, if we have Fastly directly contacting the object store, is there a large benefit to continuing to maintain our own internal mirror? In what cases do we expect to actually fall back to the mirror if the Warehouse cluster is taken out of the "hot path" for pip install? Are any of those cases better handled by some form of native object store replication that can replicate a single bucket to another and having Fastly just round robin between them?

This also reminds me that we probably have to disable the bandersnatch mirror once TUF support lands, until bandersnatch gains support for TUF as well.

@ewdurbin @woodruffw @cooperlees @di

cooperlees commented 4 years ago

Can someone please open an issue when it's fully known what bandersnatch will need to do for TUF please and I'll make it so :)

dstufft commented 4 years ago

Yea will do for sure, I'm pretty sure the answer is just "more files to copy", but it will be good to have some clarity on the issue.

cooperlees commented 4 years ago

Just so it's known, I'm down to help/own the bandersnatch mirror for PyPI if we move forward with it (if it's of an advantage). I propose to ansible it all and commit that to https://github.com/python/pypi-infra.

@ewdurbin and I were also keen to change it to use S3, but @techalchemy never came forth with the s3 plugin he's supposedly written.

But totally down to not need this for PyPI if it makes sense. E.g. does it even make sense to run a second origin on a separate cloud if we have enough credits? I only see DB sync complexities there.

cooperlees commented 3 years ago

Ping for update here.

Today I released to a way more asyncio + uvloop capable version 5.0.0 of bandersnatch. I'd love to upgrade the PyPI internal mirror to this version using our docker container. I see it still exists @ mirror.dub1.pypi.io and writing to a POSIX mount:

/dev/xvdb        12T  9.5T  2.5T  80% /data

@ewdurbin - Any chance of a small test instance + small /data so I can model a:

I will then ansible it all and put up a PR to https://github.com/python/pypi-infra we can discuss implementation before moving to the prod instance (or we could replace the prod with the test once read and prod becomes test). What ever your prefer here.

S3 support has not made it to bandersnatch yet, but we do have a contributor adding it: https://github.com/pypa/bandersnatch/pull/886

Thanks!

cooperlees commented 2 years ago

Ok - we now have a S3 capable version of bandersnatch published to Docker Hub that we can use to proof of concept if it's still of benefit moving from /dev/xvdb to a s3 backend for the PyPI Disaster Recovery mirror.

We've documented the setup here: https://bandersnatch.readthedocs.io/en/latest/storage_options.html#amazon-s3

Once again, I'm happy to help with the work here. Could be a PyCon sprint objective I'm happy to help drive. To do test this, I'd love to get a new cloud instance somewhere to run bandersnatch (preferably from a docker container) in the new S3 mode and a bucket to fill. Then we can sync away and point the CDN at the s3 bucket I believe and turn down the old instance once it's no longer being used. I am a novice at this tho, I am no s3 guru.

Happy to start a document with a lot more details etc. to plan it all out.

cc: @dstufft + @ewdurbin