paylogic / pip-accel

pip-accel: Accelerator for pip, the Python package manager
https://pypi.python.org/pypi/pip-accel
MIT License
308 stars 35 forks source link

Store cache on AWS S3? #32

Closed adamfeuer closed 9 years ago

adamfeuer commented 9 years ago

Hi,

Would you be interested in a pull request for some code that lets you put the caches (PIP_DOWNLOAD_CACHE, PIP_ACCEL_CACHE) in an AWS S3 bucket?

We're using the Atlassian Elastic Bamboo Continuous Integration system, and want to store our binary distributions made by pip-accel somewhere all our Elastic Bamboo build workers can get them. This would make our builds a lot faster.

cheers adam

xolox commented 9 years ago

Hi Adam!

I've thought about similar features in the past. The main reason I didn't commit to any of them is because I felt that pip-accel's insurance of binary compatibility (encoding the Python major/minor version in the file names in the binary cache) was too weak to guarantee compatibility across systems.

In case that wasn't clear enough, consider a Python package containing a compiled component (a *.so file). It's compiled somewhere and dynamically linked to the system packages on the system where it is compiled. If it is later used on another system that lacks the same system packages or has different versions installed, the binary distribution archive may fail to work or crash at any later point when it is used.

I always strive to build something that Just Works (TM) and what you are suggesting could be opening a can of worms (with regards to binary compatibility). On the other hand, as I hinted in the first paragraph, I definitely understand why this would be a useful feature for you and others.

Assuming I haven't discouraged you at this point: Of course this would be an "opt-in" feature because credentials are needed to upload to S3. So as long as the documentation clearly states the risks I mentioned above, I guess it can't do a lot of harm :-)

By the way, just out of curiosity: Have you considered s3fs? And if so, have you tried it? Did you run into any issues? I'm basically curious what the respective advantages/disadvantages are of s3fs vs. "a built-in connector" as you seem to be suggesting.

adamfeuer commented 9 years ago

Hi Peter,

Thanks for the detailed reply! Yes, I have thought about the binary compatibility problem. My thought for now is to leave it in the users' hands - if they want to use an S3 cache, they can set a particular directory (S3 key prefix) for the PIP binary cache for that architecture.

I've implemented this already and tried it out - it does work and it's relatively simple to explain how to use. It uses S3 as a second-level cache, using the file system cache in place as a first level cache.

At the place I'm working, we have several possible architectures, and our scripts just set the S3 cache prefix for the architecture they are building.

For instance, to configure the S3 cache, right now what I'm doing is requiring two environment variables, PIP_S3_CACHE_PREFIX and PIP_S3_CACHE_BUCKET.

It's definitely 'opt-in' - if you don't have those environment variables set, all behaves as normal using the file system.

In my current code, the user is responsible for using a different prefix per architecture. I thought about trying to get the architecture and OS name, and using that, for instance using the Python platform module. But that seemed complicated.

How does that sound?

Here's the code that does the S3 caching; here's where it's called in bdist.py.

Another issue I thought of is that this modification requires the boto python module. This is a large library, and pulls in some other modules too. You might not want to add this dependency. If not, I could make a separate python module that depends on boto, and make pip-accel look to see if this is installed. If it is, it could use it. This would be sort of a plugin... a bit more complicated to set up, but possible.

Regarding s3fs, yes I tried that first. It has several problems. First, it's not installable via Ubuntu's apt-get package manager - I couldn't find a package in the repos for it. And more importantly, I couldn't get it to perform reliably for this use case, while using boto directly worked for me every time. Thirdly, as a python module, it's very easy to install and use compared to s3fs (which needs the fuse kernel module).

We're using this to build software based on numpy and scipy and install it on Docker images. Using pip-accel with the s3-cache on our Bamboo CI system takes the build from about an hour to 4 minutes!

Anyway, if this still sounds interesting, let me know and I'll send you a pull request (including updated documentation) so you can check the actual code out in a pull-request format.

adamfeuer commented 9 years ago

Here's a pull request so you can view the code easily.

xolox commented 9 years ago

I'll go ahead and close this issue now, we can continue the discussion in the pull request you created (thanks for that by the way :-).