thought-machine / please

High-performance extensible build system for reproducible multi-language builds.
https://please.build
Apache License 2.0
2.44k stars 203 forks source link

Add gocloud to allow Blob Storage as a cache #1144

Closed robxu9 closed 2 years ago

robxu9 commented 4 years ago

A follow-up from #1140: It might be beneficial to add gocloud into please, in order to allow blob storage to be used as a cache backend (so backends like AWS S3, Google Cloud Storage, or Azure Blob Storage could be used).

This would allow users who run please builds in different CI providers to connect to a remote blob storage for their cache (or run their own one! there's always the S3-compatible flavour of the week version).

Gocloud provides a common interface, so configuration is universal; however, it seems to call each services' specific SDKs below, so it might be a heavy dependency.

sagikazarmark commented 4 years ago

While I think this is a great idea, I wonder if data retrieval charges would make this expensive enough to not worth it.

Tatskaari commented 4 years ago

We use gocloud + gcs for remote execution and I don't think it's that expensive. @peterebden would know more. The cache is multiplexed so it should still be hitting the directory cache first.

We could probably re-write the http and dircache to follow its interfaces and then just have a single cache url config option and as long as that has a registered implementation, it should just work (tm).

@peterebden Thoughts on this? Could consolidate a lot of config.

Tatskaari commented 4 years ago

@robxu9 We have some concerns around adding gocloud to the main please binary. As you said it's quite a large dependency and opens to door to adding many providers for different back-ends. Instead we'd prefer to add gocloud to //tools/http-cache. Please can be configured to talk to a normal HTTP cache, and this can act as a "gateway" to do OAuth, and s3/gcp stuff.

I'm much happier adding specific stuff to this as it's not distributed as part of the main please distribution. How does this approach suit you?

robxu9 commented 4 years ago

I think that sounds like a reasonable approach, though in that case I would also question whether we need this to be part of Please at all - I currently use rclone to do exactly that and it makes sense to offload that functionality to a project that is more dedicated to it. What do you think?

TyBrown commented 4 years ago

I was looking at using bazel-remote with Please, since it maintains both a local dir cache and uploads/downloads objects from an S3 compatible object store, so if it has the object in the bazel-remote local dir, it doesn't have to ask for it from upstream object store.

The kicker with that is that the CAS store on bazel-remote expects the key to be a sha256 hash, which I've recently learned (thanks @Tatskaari for being awesome and answering my questions) that Please cache hashes are actually a hash of the input to the rule that generated that file not the file itself.

Having something like gocloud support in the //tools/http-cache project would be really awesome, since a smaller user like myself could take advantage of that really easily.

Tatskaari commented 4 years ago

@TyBrown Nice! I think this could work however I don't plan on productionising the http cache. It will never have things like health checking or status reports.

Saying that, we could make it behave like a proxy in front of a production ready cache (s3 or nginx or whatever). The basic idea is that it's a background process that is spun up on your CI worker just before please is invoked and forwards cache requests to your cache(s).

This isn't top of my priority list right now but I will endeavour to get around to it soon (tm)

peterebden commented 4 years ago

I think we'd need to use the action cache for bazel-remote rather than the CAS. It seems to be very similar to the remote execution API whereas our HTTP cache is a lot simpler - we store a single tarball of the outputs keyed by the input hash. I presume we could make use of their action cache for that but haven't looked at it much.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any recent activity in the past 90 days. It will be closed if no further activity occurs. If you require additional support, please reply to this message. Thank you for your contributions.

sagikazarmark commented 3 years ago

I created a little something for using an object store as cache storage in CI (like GitHub Actions): https://github.com/sagikazarmark/blob-proxy

towe75 commented 2 years ago

@robxu9 please check the new command driven cache from #2234 It's already released and documented. It allows you a straight forward, simple integration of various blob and non-blob stores.

robxu9 commented 2 years ago

@towe75 This looks awesome! We can close this issue, and I'll go ahead and try that out! Thank you for contributing that!

Tatskaari commented 2 years ago

Yeah agreed. A flexible solution that should prove very useful :D