pantsbuild / binaries

A temporary home for pants static binaries and scripts
16 stars 37 forks source link

repo is exceeding soft size quota #28

Closed jsirois closed 6 years ago

jsirois commented 7 years ago

Observed when merging #27.

See the email screen shot below: pantsbuild-binaries-disk-full In addition, this is backed up by https://help.github.com/articles/what-is-my-disk-quota/

Although we no longer rely on gh-pages rendering to create a site for BinaryUtil, it behooves us to switch to LFS since this is the 1st class mechanism github supports for storing large files.

jsirois commented 7 years ago

If LFS is used, we may need alternative solutions like https://github.com/meltingice/git-lfs-s3. That though stores blobs via OID, so we'd have every binary we already serve from S3 being stored twice. Ideally we'd have a git-lfs server / custom transfer protocol pair that would sotre the LFS objects directly at their expected serving path elminating the 2x storage costs as well as the need to have a script like sync-s3.sh. Either way, we'd need to run a git-lfs server on EC2, and if we want to de-dup / direct upload via git commits, we'd need to implement a custom git-lfs server and transfer agent. So some work here.

kwlzn commented 7 years ago

I wonder if some binary pruning + something like https://rtyley.github.io/bfg-repo-cleaner/ or other forms of history pruning might help in the short term?

unclear how much of this ~5G in .git is currently necessary:

[omerta binaries (kwlzn/watchman_statedir)]$ du -sh build-support/
6.6G    build-support/
[omerta binaries (kwlzn/watchman_statedir)]$ du -sh .git
4.9G    .git
benjyw commented 7 years ago

I've used BFG, it's pretty cool, and way faster than other git history rewriters.

On Wed, Sep 6, 2017 at 2:10 AM, Kris Wilson notifications@github.com wrote:

I wonder if some binary pruning + something like https://rtyley.github.io/bfg-repo-cleaner/ or other forms of history pruning might help in the short term?

unclear how much of this ~5G in .git is currently necessary:

[omerta binaries (kwlzn/watchman_statedir)]$ du -sh build-support/ 6.6G build-support/ [omerta binaries (kwlzn/watchman_statedir)]$ du -sh .git 4.9G .git

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pantsbuild/binaries/issues/28#issuecomment-327423762, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfS_JnYJUoChIWj0EzqWCtS721aFMn1ks5sfmFogaJpZM4OxYLp .

jsirois commented 7 years ago

Seems like its worth a shot, although I'd be surprised if this was effective since the blobs have nearly all been single version binaries committed exactly once.

benjyw commented 7 years ago

It can at least show us what the largest blobs are.

kwlzn commented 7 years ago

seems like it'd have to be the combination of pruning some old versions in gh-pages and running bfg to prune objects above ~300k (i.e. not scripts) from the git history. aka the joys of utilizing git as a binary store w/ limited storage.

seems to me like go is the biggest space consumer by far:

$ du -sh *
5.5G    go
372M    node
146M    protobuf
 48M    ragel
 29M    stack
404M    thrift
 36M    watchman
 44M    yarnpkg

so maybe we could thin that out without terribly affecting old customers? or maybe look at request stats to see which paths are least used?

kwlzn commented 7 years ago

(and ftr I'm all for the git-lfs approach in the mid/long term - just thinking topical relief)

stuhood commented 6 years ago

A few things from recent discussions: 1) if we trust S3, then there is no need to actually commit any of the binaries in git anymore. 2) we need to do something to encourage folks to create their own mirrors of S3, rather than hitting our hosting directly. 3) there is a strong potential alignment with the Snapshots that back remote process execution: we might be able to migrate this storage to that backend, and use that as a way to encourage creating your own writable mirror.

stuhood commented 6 years ago

With regard to item 2 and 3 above: the idea that has been coalescing is to lazily mirror the binutils entries into a user's remote execution CAS (which will be the medium term replacement for the buildcache service). Then the upstream binutils store would be a base/default readonly CAS instance, which would naturally end up cloned into a user's read/write CAS to remove load from upstream.

stuhood commented 6 years ago

One additional requirement is the ability to do a complete sync of the store to another location (to support the "our CI environment does not have access to the internet" case).

stuhood commented 6 years ago

I'm going to open a PR to alter the suggested workflow to do item 1 from: https://github.com/pantsbuild/binaries/issues/28#issuecomment-358387841 ... ie, to no longer commit the binaries to the repo.

In order to allow for some disaster recovery here, the PR will also suggest that we turn on versioning for the S3 bucket in which we store the binaries.