paylogic / pip-accel

pip-accel: Accelerator for pip, the Python package manager
https://pypi.python.org/pypi/pip-accel
MIT License
308 stars 35 forks source link

adding option to store bdist files in S3 as a second-level cache #33

Closed adamfeuer closed 9 years ago

adamfeuer commented 9 years ago

This is so you can check out the code easily. I haven't done documentation yet - if this looks ok, I'll do the docs.

This change really accelerates building on ephemeral Elastic Bamboo instances - since their local filesystems have an empty cache when they start up.

jzoldak commented 9 years ago

I'm also interested in pip-accel with the upload to S3. What would help to get this PR moved along? Fix the broken test? @xolox are there other concerns?

adamfeuer commented 9 years ago

Since there's interest, I'll update documentation and see if I can fix the test. I also want to get this running on one of my build boxes at work.

coveralls commented 9 years ago

Coverage Status

Coverage decreased (-2.43%) when pulling 124465ff2f931959348cf647a9a0acdcd5dc0fc5 on adamfeuer:s3-cache into 84ab4bfada8c27fbfa3c72b22d0d7b1a861a5105 on paylogic:master.

jzoldak commented 9 years ago

@adamfeuer Awesome. I fixed the test in PR #1 on your fork. I'm working on adding some tests.

adamfeuer commented 9 years ago

I also fixed the problem too, and pushed it - looks like the test passes now:

https://travis-ci.org/paylogic/pip-accel/builds/39992802

:-)

-adam

On Tue, Nov 4, 2014 at 1:22 PM, Jesse Zoldak notifications@github.com wrote:

@adamfeuer https://github.com/adamfeuer Awesome. I fixed the test in PR

1 on your fork https://github.com/adamfeuer/pip-accel/pull/1.

I'm working on adding some tests.

— Reply to this email directly or view it on GitHub https://github.com/paylogic/pip-accel/pull/33#issuecomment-61716785.

Adam Feuer adam@adamfeuer.com

adamfeuer commented 9 years ago

I also committed some basic documentation.

coveralls commented 9 years ago

Coverage Status

Coverage decreased (-2.43%) when pulling e7b808623bdfd9f994ae8fd49213beb4be98332a on adamfeuer:s3-cache into 84ab4bfada8c27fbfa3c72b22d0d7b1a861a5105 on paylogic:master.

xolox commented 9 years ago

@adamfeuer and @jzoldak: To clear up any worries, I'm planning to merge this soon (hopefully this week) because I definitely see how this is a useful feature to have for a lot of (potential) users, even if I don't have a direct use for this myself (at the moment). Sorry by the way for being slow to respond and thanks for the messages to remind me, I've been swamped at work due to recent organizational changes that I won't get into here ;-).


The most important thing for me is finding a way to make the Boto dependency optional while still keeping things KISS for the user (a bit more work for me in preparing releases would not be a problem, I always automate release management anyway :-).

I've been thinking about 1) the most elegant way to implement this (I want my projects to be of high software engineering quality) vs. 2) a pragmatic way to implement this (I want to have the feature merged soon so as not to waste the efforts of @adamfeuer and the enthusiasm of @jzoldak). Some ideas based on those conflicting considerations:

1. Architecture astronaut mode: Maybe the most elegant way would be a plug-in mechanism that uses e.g. setuptools entry points (the best thing the Python world has for automatic discovery of plug-ins installed as separate packages, AFAIK). The problem is that it requires a bit of "plumbing" and I'm not sure how easy it is to generalize this pull request into a generic "storage backend" plug-in layer (I still have to dive into the code).

2. KISS mode: A pragmatic way would be to use setuptools' support for 'extras' to define the Boto dependency, but keep all of the code in the main pip-accel package. Then installing pip-accel as pip install pip-accel[s3] could pull in the Boto dependency. The code base wouldn't need any knowledge of this extra, it would just do this:

try:
    import boto
    enable_the_feature()
except ImportError:
    disable_the_feature()

3. KISS^2 mode: Setuptools' extras support in pip seems to have some rough edges, so an even more pragmatic way would be to keep the pip-accel package and additionally introduce a pip-accel-s3 package that pulls in pip-accel and Boto and provides the glue needed to tie them together (i.e. the meat of this pull request). The main package could try to import the 'sub-package' and gracefully handle the ImportError.

I guess the latter two options provide the quickest (most pragmatic) way to get this merged without a hard dependency on Boto. I would still find the first option the most elegant one, but it requires a bit more work up front (to pay off in a hypothetical future ;-). Maybe option two is the sweet spot for merging this.

xolox commented 9 years ago

I was wondering, what was your reasoning for putting the binary distribution cache in S3 but not the source distribution cache? (given what you've explained about using hosts with ephemeral local storage)


(deep into the internals of pip-accel)

The reason I ask is that pip-accel needs pip to generate a dependency set, and to do that pip requires unpacked source distributions. This implies that every pip-accel install invocation needs those source distributions to be available for unpacking, even if the complete dependency set can be satisfied using cached binary distributions :-(

This points to a really ugly and inefficient part of the communication between pip and pip-accel which is (AFAICT) very complex to solve because it would require pip-accel to re-implement large parts of the logic already available in pip but impossible to re-use effectively.


(coming back up to a high level)

Knowing what I've explained above, would you choose to also store the source distribution archives in S3? It may seem kind of silly because whether you're downloading from PyPI, an external distribution site or an S3 bucket, it's all HTTP, so who cares? Some reasons I can think of why you might care:

adamfeuer commented 9 years ago

@xolox Regarding why I didn't put the sdists on S3 too - I was trying to keep things simple, and since I am depending on being able to reach PyPI anyway, I figured that was ok.

Regarding how to deal with the Boto dependency - I like your #2 best. I considered making a separate package that would depend on pip-accel (pip-accel-s3) but thought that most people would have a hard time finding out about it- often the hardest thing about using software is to know what software to use. :-)

Anyway, I'm willing to implement #2, update the docs, and push it to the branch. I'll take a shot at that tomorrow.

jzoldak commented 9 years ago

FWIW because of the way that pip does unpacking/dependencies, and also because I'm primarily looking to optimize CI worker spinup, I would prefer an implementation that allows uploading the distributions on S3. I can refactor or add that feature after we get through this bit.

adamfeuer commented 9 years ago

@jzoldak Ok, that sounds good. I'd like to keep this PR as small as possible.

coveralls commented 9 years ago

Coverage Status

Coverage decreased (-3.82%) when pulling 601d1184654f20551482c41c46a803f3ea01905c on adamfeuer:s3-cache into 84ab4bfada8c27fbfa3c72b22d0d7b1a861a5105 on paylogic:master.

adamfeuer commented 9 years ago

Ok, I implemented @xolox's suggestion #2 for how to deal with making boto an optional dependency - we now check if the S3 cache environment vars are configured - if so, then we check for boto. If either fails, we don't use the cache.

I moved the cache code to __init__.py so that it can be more easily used by other routines - for instance if we want to implement the sdist caching.

Finally, I updated the docs to indicate that if you want to use the S3 cache feature, it's up to you to import boto yourself, otherwise the S3 cache will not be used.

I ran some manual tests and it works. I can write some unit tests - using the moto mock_s3 class - let me know if you want me to do that. It would add a build-time dependency on moto, but would enable us to write fast tests that don't actually hit AWS S3.

If you're ok with the manual testing for now - this is ok to merge.

jzoldak commented 9 years ago

@adamfeuer I'm a strong advocate of tests, and would love to see unit tests added. Again, I can circle back and add them later if you don't have time or inclination right now.

Perhaps we create a test_requirements.txt requirements file to separate out the runtime dependencies from the test dependencies. Leave the requirements.txt logic in setup.py alone and in the travis.yml file install: pip install -r test_requirements.txt

adamfeuer commented 9 years ago

@jzoldak I like tests too - and will add them, using a test_requirements.txt, and will update the travis.yml file as you suggest.

The reason I was hesitant was that the tests will depend on moto and boto, but with the setup you suggest, only people running the tests will be impacted.

coveralls commented 9 years ago

Coverage Status

Coverage decreased (-3.82%) when pulling 6ddbb42d60e846b20b24e7f359b1d3b13110c2ae on adamfeuer:s3-cache into 84ab4bfada8c27fbfa3c72b22d0d7b1a861a5105 on paylogic:master.

xolox commented 9 years ago

Hi @adamfeuer and @jzoldak,

I've merged this pull request and published the result as pip-accel 0.14. Note that I changed literally all of the code introduced in the pull request. That's not to say there was anything wrong with the code, I just had bigger plans (the previously described cache backend plug-in mechanism).

If you are interested you can review the result in the merge commit 8ff50a9aebd0b39ac0155744b5227b09855463fe. I also renamed the relevant environment variables in order to consistently use the $PIP_ACCEL_ prefix because I don't want anyone to get confused between where pip stops and pip-accel begins. The environment variabels are documented in the readme as suggested in the pull request.

I hope this works well for you guys. Thanks to both of you, both for the pull request and the discussion. Feedback is welcome!

Some relevant notes regarding previous discussions in this pull request:

xolox commented 9 years ago

Right now there's no automated test yet, but I did verify manually that the Amazon S3 backend works in cPython 2.6, cPython 2.7, cPython 3.4 and PyPy 2.7 (you can run make test or just tox to verify, but note that cPython 2.7 and PyPy 2.7 generate conflicting binary distribution archives, I hope to fix this tonight).

Fixed in d43e2601d908bbd587c6b4ee2687c5a39823ae8a.

adamfeuer commented 9 years ago

@xolox Awesome! I will try it out. I will also try to write an automated test with the moto mock_s3 class.

jzoldak commented 9 years ago

@xolox and @adamfeuer thanks this is great. I get the point about caching source distribution archives as being non-trivial and don't see it as an urgent need.