pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

local double copy of PyPI via HTTP #51

Closed gpcimino closed 6 years ago

gpcimino commented 6 years ago

Hi all,

in my organization we keep a local copy of PyPI using bandersnatch. Let's call this copy DMZ repo. We would like to create (and keep up to date) a second copy of PyPI (let's call it internal repo) in another zone of our network mirroring from the DMZ repo.

The DMZ repo expose the file structure download via bandersnatch on simple HTTP server (no HTTPS). When i try to connect bandersnatch to create the internal copy to the DMZ repo i get this error:

2018-07-26 16:54:47,058 ERROR bandersnatch.master - Master URL http://dmz is not https scheme
Traceback (most recent call last):
  File "/opt/repos/PyPI/venv/bin/bandersnatch", line 11, in <module>
    sys.exit(main())
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/main.py", line 112, in main
    args.func(config)
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/main.py", line 21, in mirror
    config.getfloat('mirror', 'timeout'),
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/master.py", line 35, in __init__
    raise ValueError("Master URL {0} is not https scheme".format(url))
ValueError: Master URL http://dmz is not https scheme

Apparently bandersnatch wants HTTPS. If I remove this conditional statement in master.py looks like bandersnatch works OK.

Why is HTTPS enforced?

Thanks GP

cooperlees commented 6 years ago

This was implemented before I started contributing, but, I personally like to know that I don't have a Man in the Middle Attack as 99.99% of people are syncing from PyPI to local bandersnatch repos. It's also very trivial to add HTTPS to your "DMZ" instance. So I would reccomened that route.

Even tho you're hitting your DMZ mirror, you're also still hitting PyPI's XML RPC API to calculate the differences from where your "internal" mirror is (we are unable to mirror that). You are hitting the JSON API and pulling the packages down locally tho, from your DMZ.

This all said, I will accept a PR that defaults to enforcing HTTPS only, as we are today, and allows you to negate that check.

Thanks for asking. Feel free to ask any more questions.

gpcimino commented 6 years ago

@cooperlees thanks for the clear answer.

I think the rsync option from DMZ to internal is the way to go. This is what we have in place now, we are not just quite happy with the performance of rsync with gazillion of files in PyPI.

Due to my lack of knowledge of PEP 381 now i realized the internal bandersnatch needs connection to the (real) PyPI server anyway for the XML RPC API call. Just for my awareness: what's the best way to have the XML RPC server side part on DMZ? Should i use something like devpi?

Thanks Giampaolo

cooperlees commented 6 years ago

Yeah the 1000s of files and directories do not help. This is why I also suggested btrfs (or it could be zfs) differential sends. They will be fast as they are at the block level based on snapshots.

devpi also can't replicate the XML RPC API, nothing really can as it's the source of truth for package and mirror serials that PyPI calculates in real time.

I've never used devpi, but for it's PyPI operations, it could run in your DMZ and cache all the PyPI package that your infra needs, but it seems it runs as a proxy. devpi does have a "replication" feature that could possibly satisfy you needs. I don't know all your goals, but it sounds like it could do what you want, especially if the replication also syncs the PyPI cache, which I am not sure it does.

cooperlees commented 6 years ago

Were you able to find a work around for this? If so please share and we'll close this issue.

gpcimino commented 6 years ago

Sorry for late answer. I gave up and we stick with rsync from DMZ to internal mirror, even if performance are not outstanding. Thanks for your help