rust-lang / infra-team

Coordination repository for the Rust infra team
https://www.rust-lang.org/governance/teams/infra
Apache License 2.0
18 stars 9 forks source link

RFC: Shorten the storage lifetime of non-master builds on rust-lang-ci2 #13

Open kennytm opened 6 years ago

kennytm commented 6 years ago

This is about the item Delete failed deployment faster, probably via object-tagging the successful builds I've put on the road-map document. Just to write down what I mean there. Not really important though, since storage isn't the most costly item AFAIK.


Summary

Adjust the S3 deletion policy so that non-master builds are actually deleted in 30 days, allowing us to lengthen real master builds from 90 days (13 weeks) to 133 days (19 weeks) with similar storage cost.

Motivation

As announced in https://internals.rust-lang.org/t/updates-on-rusts-ci-uploads/6062, we currently implement the following deletion policies:

However, all try builds are currently pushed to rustc-builds and rustc-builds-alt, meaning the last rule is useless.

Furthermore, due to test errors (no matter legit or spurious), a lot of builds are in fact wasted.

As of 2018-03-29, there are 854 builds on the rust-lang-ci2 bucket. Out of which, 498 builds are not part of the master branch. Out of which, ≥211 should be failed builds.

Type Count
Successful master builds 356
Try builds ≤287
Failed master builds ≥211
Script to extract the information 1. Download the list ```sh curl -o 1.xml 'http://s3-us-west-1.amazonaws.com/rust-lang-ci2/?prefix=rustc-builds/&delimiter=/' ``` 2. Analyze the list ```python from xml.etree.ElementTree import * import subprocess x = fromstring(open('1.xml').read()) sha1 = [y.text[-41:-1] for y in x.iterfind('.//{http://s3.amazonaws.com/doc/2006-03-01/}Prefix')] valids = [] invalids = [] for i, s in enumerate(sha1): if subprocess.run(['git', 'rev-parse', '--quiet', '--verify', s + '^{commit}'], cwd='rust').returncode != 0: invalids.append(s) else: valids.append(s) print(i, '/', len(sha1), '=', s, '->', len(invalids), 'vs', len(valids)) ```

If we could apply the 30-day policy to all these 498 builds, we could free up 40% space, meaning we could use these free space to store the legit builds longer to 133 days.

Extending the 90 day limit is useful because this enables bisection to older versions.

Explanation

AWS S3's Object Lifecycle Management supports expiring objects filtered based on key prefixes (which we are doing now), or object tags. We can tag the master builds to have a longer expiration period (133 days), and then the default policy can be shortened to 30 days.

Lifecycle configuration

Ideal configuration ```xml ExpireTryBuilds rustc-builds/ Enabled 30 ExpireAltTryBuilds rustc-builds-alt/ Enabled 30 ExpireMasterBuilds branch master Enabled 133 ```

This assumes conflicting filters will be resolved in the way we desire. If not, we may need to ensure they are non-overlapping by tagging everything. The S3 documentation is unclear about this.

Non-conflicting configuration ```xml ExpireNonMasterBuilds branch non-master Enabled 30 ExpireMasterBuilds branch master Enabled 133 ```

Object tagging

PUT /rustc-builds/fff31afac8f535a5e281219e807f8bc7290b3536/cargo-nightly-x86_64-apple-darwin.tar.gz?tagging

<Tagging>
  <TagSet>
     <Tag>
       <Key>branch</Key>
       <Value>non-master</Value>
     </Tag>
  </TagSet>
</Tagging>

Object tags can be assigned at when uploading the object using the X-Amz-Tagging header, but this does not seem to be supported by Travis's or AppVeyor's S3 deployment rules. Nevertheless, we could update the tags using the after_deploy: or deployment_success: scripts.

It seems object tagging does not support tagging all objects by prefix only, meaning we need to list all objects and tag each object one-by-one.

Tagging to branch=master should be performed after a commit is merged into master. This can be done by running a Travis CI job alongside tool-state-update.

Cost

Object tagging is not free, but should be significantly cheaper than the actual cost of uploading the artifacts.

Currently, we generate 511 artifacts for every normal build, and 44 for try build. Object tagging costs 1¢ per 10000 tags, so every build will cost 0.05¢ on failure and 0.10¢ on success.

An alternative is to move the successful objects to another directory (aws s3 mv s3://... s3://...). Moving is implemented by a COPY followed by a DELETE, which the costs are 0.55¢ per 1000 requests and free respectively. So every build will be free on failure and 0.28¢ on success.

Given we have 40% "failure rate", the average additional cost per build for object tagging would be 0.08¢ and moving would be 0.17¢.

Compare this to the upload cost. Every normal build will generate ~12 GiB of artifacts, and every try build ~1.8 GiB. The storage cost is about 2.5¢/GB, thus every build will cost 30¢ to save. The tagging/moving cost is basically nothing.

Compared to moving, object tagging does not affect URL, so it should be more preferable to keep existing tools stable.

Drawbacks

The engineering involved is significantly more complicated.

Alternatives

Do nothing and keep the 90-day limit. This is too short for bisection though. We release a new version every 42 days, so 90 days is not enough to cover 3 versions entirely. If a regression cannot be found within 6 days after a stable version is released, we will never be able to get a precise regression.

Just extend the limit to 168 days. Assuming we produce 12 builds per day, this means an additional $281/month for storage cost.

Transition builds to Standard Infrequent Access after 30 days, and delete after 109 days. Currently all objects are stored using STANDARD storage class. STANDARD_IA is cheaper to store (1.9¢/GB). However, this is not really that cheap, the number of days can be exchanged is not so impressive.

Expire failed builds after 7 days, try builds after 30 days, and normal builds after 147 days. We could further differentiate failed builds and successful try builds, which could extend the normal expiry duration even more.

Unresolved questions

None yet

alexcrichton commented 6 years ago

Thanks for all the analysis here @kennytm! Right now the numbers are chosen fairly arbitrarily without much attention paid to cost, so I think we could probably just pick the ideal time limit and go ahead and set that. Along those lines, what do you think the ideal limit for these builds expiring would be?

kennytm commented 6 years ago

@alexcrichton I prefer 168 days (42 weeks), which covers 4 cycles and thus can ensure we could do a precise regression through the entire stable version.

alexcrichton commented 6 years ago

Ok! I've configured 168 days on the rust-lang-ci2 bucket, and if cost becomes an issue we can certainly revisit!