Open kennytm opened 6 years ago
Thanks for all the analysis here @kennytm! Right now the numbers are chosen fairly arbitrarily without much attention paid to cost, so I think we could probably just pick the ideal time limit and go ahead and set that. Along those lines, what do you think the ideal limit for these builds expiring would be?
@alexcrichton I prefer 168 days (42 weeks), which covers 4 cycles and thus can ensure we could do a precise regression through the entire stable version.
Ok! I've configured 168 days on the rust-lang-ci2 bucket, and if cost becomes an issue we can certainly revisit!
This is about the item Delete failed deployment faster, probably via object-tagging the successful builds I've put on the road-map document. Just to write down what I mean there. Not really important though, since storage isn't the most costly item AFAIK.
Summary
Adjust the S3 deletion policy so that non-master builds are actually deleted in 30 days, allowing us to lengthen real master builds from 90 days (13 weeks) to 133 days (19 weeks) with similar storage cost.
Motivation
As announced in https://internals.rust-lang.org/t/updates-on-rusts-ci-uploads/6062, we currently implement the following deletion policies:
rustc-builds
andrustc-builds-alt
: Delete after 90 daysrustc-builds-try
: Delete after 30 days.However, all try builds are currently pushed to
rustc-builds
andrustc-builds-alt
, meaning the last rule is useless.Furthermore, due to test errors (no matter legit or spurious), a lot of builds are in fact wasted.
As of 2018-03-29, there are 854 builds on the
rust-lang-ci2
bucket. Out of which, 498 builds are not part of themaster
branch. Out of which, ≥211 should be failed builds.Script to extract the information
1. Download the list ```sh curl -o 1.xml 'http://s3-us-west-1.amazonaws.com/rust-lang-ci2/?prefix=rustc-builds/&delimiter=/' ``` 2. Analyze the list ```python from xml.etree.ElementTree import * import subprocess x = fromstring(open('1.xml').read()) sha1 = [y.text[-41:-1] for y in x.iterfind('.//{http://s3.amazonaws.com/doc/2006-03-01/}Prefix')] valids = [] invalids = [] for i, s in enumerate(sha1): if subprocess.run(['git', 'rev-parse', '--quiet', '--verify', s + '^{commit}'], cwd='rust').returncode != 0: invalids.append(s) else: valids.append(s) print(i, '/', len(sha1), '=', s, '->', len(invalids), 'vs', len(valids)) ```If we could apply the 30-day policy to all these 498 builds, we could free up 40% space, meaning we could use these free space to store the legit builds longer to 133 days.
Extending the 90 day limit is useful because this enables bisection to older versions.
Explanation
AWS S3's Object Lifecycle Management supports expiring objects filtered based on key prefixes (which we are doing now), or object tags. We can tag the master builds to have a longer expiration period (133 days), and then the default policy can be shortened to 30 days.
Lifecycle configuration
Ideal configuration
```xmlThis assumes conflicting filters will be resolved in the way we desire. If not, we may need to ensure they are non-overlapping by tagging everything. The S3 documentation is unclear about this.
Non-conflicting configuration
```xmlObject tagging
Object tags can be assigned at when uploading the object using the
X-Amz-Tagging
header, but this does not seem to be supported by Travis's or AppVeyor's S3 deployment rules. Nevertheless, we could update the tags using theafter_deploy:
ordeployment_success:
scripts.It seems object tagging does not support tagging all objects by prefix only, meaning we need to list all objects and tag each object one-by-one.
Tagging to
branch=master
should be performed after a commit is merged into master. This can be done by running a Travis CI job alongside tool-state-update.Cost
Object tagging is not free, but should be significantly cheaper than the actual cost of uploading the artifacts.
Currently, we generate 511 artifacts for every normal build, and 44 for try build. Object tagging costs 1¢ per 10000 tags, so every build will cost 0.05¢ on failure and 0.10¢ on success.
An alternative is to move the successful objects to another directory (
aws s3 mv s3://... s3://...
). Moving is implemented by a COPY followed by a DELETE, which the costs are 0.55¢ per 1000 requests and free respectively. So every build will be free on failure and 0.28¢ on success.Given we have 40% "failure rate", the average additional cost per build for object tagging would be 0.08¢ and moving would be 0.17¢.
Compare this to the upload cost. Every normal build will generate ~12 GiB of artifacts, and every try build ~1.8 GiB. The storage cost is about 2.5¢/GB, thus every build will cost 30¢ to save. The tagging/moving cost is basically nothing.
Compared to moving, object tagging does not affect URL, so it should be more preferable to keep existing tools stable.
Drawbacks
The engineering involved is significantly more complicated.
Alternatives
Do nothing and keep the 90-day limit. This is too short for bisection though. We release a new version every 42 days, so 90 days is not enough to cover 3 versions entirely. If a regression cannot be found within 6 days after a stable version is released, we will never be able to get a precise regression.
Just extend the limit to 168 days. Assuming we produce 12 builds per day, this means an additional $281/month for storage cost.
Transition builds to Standard Infrequent Access after 30 days, and delete after 109 days. Currently all objects are stored using
STANDARD
storage class.STANDARD_IA
is cheaper to store (1.9¢/GB). However, this is not really that cheap, the number of days can be exchanged is not so impressive.Expire failed builds after 7 days, try builds after 30 days, and normal builds after 147 days. We could further differentiate failed builds and successful try builds, which could extend the normal expiry duration even more.
Unresolved questions
None yet