Open iclanton opened 3 years ago
I love this, because I've tried to implement this twice myself, via docker registry images and also npm packages. In the end, it would fit much better within rush.
Maybe it would be a good idea to provide some kind of hooks in a js file so that people can upload/download the cache files in whatever way they choose? The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.
@wbern
Maybe it would be a good idea to provide some kind of hooks in a js file so that people can upload/download the cache files in whatever way they choose?
I've thought about that. We've put together a pretty flexible plugin system in Heft, but so far I haven't been able to come up with a way that would cleanly work in Rush. Maybe a common/config/rush/plugins.json
file that lists package names+versions, installs them, and calls an apply
function when Rush boots up? That'd mean that plugins must be published to a package feed, which is unfortunate. It also potentially gets confusing when someone is using Rush's API. I don't love just plain JS hooks because this is the kind of thing that should probably be versioned and plugin authors should have a typesafe interface to code against. What are your thoughts here?
The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.
Azure Storage's connection strings are a simple way to specify a storage endpoint (i.e. - an account name) and an access token. I think we'll want to support some sort of interactive and AAD authentication, and probably some way to pass credentials in via env variable. I think those will be features that we'll need to figure out after some community feedback because what works for my team at Microsoft is almost definitely not going to work for everyone else. What are your concerns here? It'd be good to try to be aware of other scenarios as early as possible.
I don't love just plain JS hooks because this is the kind of thing that should probably be versioned and plugin authors should have a typesafe interface to code against. What are your thoughts here?
I don't know if this is a bad idea, or too much work, but what about this:
npm run cache-script -- --action="download"
), or use some environment variables to perform this. This is the trade-off for escaping the limitation of plain .js files (the projects are now handling the runtime instead).You could also have a seperate package.json script for each action/hook type, to avoid the parameter/env passing.
I think those will be features that we'll need to figure out after some community feedback because what works for my team at Microsoft is almost definitely not going to work for everyone else. What are your concerns here? It'd be good to try to be aware of other scenarios as early as possible.
My org uses Amazon S3 to store artifacts currently, and we intend to use the S3 buckets to serve the files through a CDN (Cloudfront), so that is one thing to consider.
We also have a widely established method of uploading files complete .tar.gz files of each monorepo project to Artifactory which we use to later serve the files through a CMS (unpacked). I'm not very familiar with the details you mention (like AAD authentication) so I'm not sure how these components would fit your proposal and ideas.
Launching a separate process will probably impact perf for a 100% cached repo quite a bit. I also don't think the cache provider should be configureable per-project.
My org uses Amazon S3 to store artifacts currently, and we intend to use the S3 buckets to serve the files through a CDN (Cloudfront), so that is one thing to consider.
Oh yeah I absolutely want to also support S3 in addition to Azure Storage. I want to get that in soon after the initial implementation is checked in. The official AWS package is pretty enormous, so we'll probably want to call its APIs via REST, or use a trimmed down package.
We also have a widely established method of uploading files complete .tar.gz files of each monorepo project to Artifactory which we use to later serve the files through a CMS (unpacked). I'm not very familiar with the details you mention (like AAD authentication) so I'm not sure how these components would fit your proposal and ideas.
Would you want to use Artifactory as your cache store? That sounds more like a post-build rush publish
than it does like a build cache.
Launching a separate process will probably impact perf for a 100% cached repo quite a bit. I also don't think the cache provider should be configureable per-project.
That's a very good point!
Ok, if S3 is getting first class support then that's great! Actually that's what a colleague of mine mentioned, I was not sure if you'd be up for that but that's great to hear.
Regarding artifactory, yes, first I thought it could be one and the same, but build caches probably deserve their own storage solution.
Here's my initial feedback:
The store can be in the
common/config/temp folder
, a remote object store like Azure Storage, some machine-wide shared store, or a combination of those
"cacheProvider": "azure-storage"
is a single setting, not an array.common/config/rush/build-cache.json
{ "cacheProvider": "filesystem", /** * A list of folder names under each project root that should be cached. These folders should not be tracked by git. */ "projectOutputFolderNames": ["lib", "dist"] }
"projectOutputFolderNames"
seem like an inherent property of the toolchain, not the monorepo. It seems like they should be specified in each project folder, via rig.json
typically. It may seem convenient for common/config/rush/build-cache.json
to specify "defaults", but that would mean that a project's configuration cannot be correctly interpreted without considering the monorepo config files, which is confusing, and incompatible with the idea of a rig being an NPM package. (This is the reason why we eliminated the Heft's global config files.)/** * A connection string for accessing the Azure storage account. */ "connectionString": "BlobEndpoint=https://example-account.blob.core.windows.net/;SharedAccessSignature=<...SAS token...>",
- Is
<...SAS token...>
a security credential? If so, it seems odd to be storing such credentials in the Git repo. If all the developers will reuse the same token to access a shared container, maybe they should put it in a non-Git place like~/.rush/tokens.json
. The ideal obviously would be to add users to a security group that has write permission to the container, so that each user has their own token. For example, if someone leaves the company, this way you could revoke their access without forcing everyone else to update their token. Does Azure support that? AWS does.The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.
"azureServerUrl"
and "azureOtherThing"
versus "awsContainerName"
etc. Munging everything into a text string makes it harder to understand what the options are. Also the validation and error reporting for malformed "connection strings" is typically not very good.Symlinks cannot be stored easily in a tarball. Rush should probably report an error if a symlink is encountered while archiving a file.
@octogonz - here are some answers to your questions:
- How would this interact with the multiphase model that you proposed in #2300? Would we cache each phase as a separate tarball? Or would only certain compositions of phases be cached? We don't need to implement that in the MVP of course, but it would be helpful for our design to consider it, so we don't need to overhaul the build cache config files when the multiphase feature is introduced.
I've thought about this. I think in the MVP of the multiphase feature, we'll just cache the output of all of each projects' phases, and consider all of the project's phases complete in the event of a cache hit. Because both of these features will be released as opt-in "experiments," we can tweak this design as we get more experience with it.
- How does the "combination" of caches work? When installing, would Rush consult the caches in a specific order? In the proposed config file,
"cacheProvider": "azure-storage"
is a single setting, not an array.
Right now the MVP supports only a single cache provider. It'd be pretty easy to change the "filesystem" provider to be another option (something like "enableLocalRepoCache": true
or something) that is queried before the other configured provider. I'd like to introduce a per-user global settings file that would allow someone to point their local cache at a system-wide location, so multiple clones of the same repo may share cache, but I think that's out of scope for the MVP.
- The
"projectOutputFolderNames"
seem like an inherent property of the toolchain, not the monorepo. It seems like they should be specified in each project folder, viarig.json
typically. It may seem convenient forcommon/config/rush/build-cache.json
to specify "defaults", but that would mean that a project's configuration cannot be correctly interpreted without considering the monorepo config files, which is confusing, and incompatible with the idea of a rig being an NPM package. (This is the reason why we eliminated the Heft's global config files.)
I think I agree here - it probably should be an option in the per-project config, and those should have rig support. I think I'd like to make that change in a follow-up PR after the initial checkin. Should be a simple change, but it'll be easier to reason with as a standalone change.
- Is
<...SAS token...>
a security credential? If so, it seems odd to be storing such credentials in the Git repo. If all the developers will reuse the same token to access a shared container, maybe they should put it in a non-Git place like~/.rush/tokens.json
.
Yes, it is a security credential. I've removed this property and, instead, created a file called ~/.rush/build-cache-credentials-cache.json
and a rush update-build-cache-credentials
command that allows users to explicitly specify credentials or to go through an interactive login flow.
The ideal obviously would be to add users to a security group that has write permission to the container, so that each user has their own token. For example, if someone leaves the company, this way you could revoke their access without forcing everyone else to update their token. Does Azure support that? AWS does.
Yes, Azure supports that.
- I agree with @wbern. It would be better for our JSON schema to have separate config blocks for each supported provider, so that you get nice IntelliSense and doc comments for
"azureServerUrl"
and"azureOtherThing"
versus"awsContainerName"
etc. Munging everything into a text string makes it harder to understand what the options are. Also the validation and error reporting for malformed "connection strings" is typically not very good.
I've updated the design to stick the Azure-specific stuff in a "azureBlobStorageConfiguration
" property.
I've made some updates to the design after reviewing feedback.
common/config/rush/build-cache.json
I've updated the schema of this file to drop the connection string property and stick all of the provider-specific stuff into a property. The filesystem version of the file looks the same:
{
"cacheProvider": "filesystem",
/**
* A list of folder names under each project root that should be cached. These folders should not be tracked by git.
*/
"projectOutputFolderNames": ["lib", "dist"]
}
The Azure Storage provider of the file looks like this now:
{
"cacheProvider": "filesystem",
/**
* A list of folder names under each project root that should be cached. These folders should not be tracked by git.
*/
"projectOutputFolderNames": ["lib", "dist"],
"azureBlobStorageConfiguration": {
/**
* The name of the the Azure storage account to use for build cache.
*/
"storageAccountName": "example-account",
/**
* The name of the container in the Azure storage account to use for build cache.
*/
"storageContainerName": "files",
/**
* (OPTIONAL) The Azure environment the storage account exists in. Defaults to AzureCloud.
*/
"azureEnvironment": "AzureCloud", // Can also be "AzureChinaCloud", "AzureUSGovernment", or "AzureGermanCloud"
/**
* An optional prefix for cache item blob names.
*/
"blobPrefix": "example",
/**
* If set to true, allow writing to the cache. Defaults to false.
*/
"isCacheWriteAllowed": true
}
}
<project root>/.rush/build-cache.json
➔ <project root>/config/rush/build-cache.json
This file has been moved into the project's config/rush/
folder, with the same schema.
I've introduced a ~/.rush/build-cache-credentials-cache.json
file with the following format (for an "azure-blob-storage"
cache provider as an example):
{
"cacheEntries": {
"azure-blob-storage|AzureCloud|example-account|files|cacheWriteAllowed": {
"expires": 1608948745090,
"credential": "BlobEndpoint=https://example-account.blob.core.windows.net/;SharedAccessSignature=<...SAS token...>"
}
}
}
rush update-build-cache-credentials
I've also introduced a rush update-build-cache-credentials
command with several (mutually-exclusive) flags:
--credential CREDENTIAL_STRING
This parameter is used to explicitly specify a credential
--interactive
/-i
This flag executes an interactive login flow. For Azure Blob Storage, that looks like this:
Rush Multi-Project Build Tool 5.35.2 (unmanaged) - https://rushjs.io
Node.js version is 10.22.1 (LTS)
Starting "rush update-build-cache-credentials"
╔══════════════════════════════════════════════════════════════════════════════════╗
║ To sign in, use a web browser to open the page ║
║ https://microsoft.com/devicelogin and enter the code ABC123XYZ to ║
║ authenticate. ║
╚══════════════════════════════════════════════════════════════════════════════════╝
The user opens a browser, navigates to https://microsoft.com/devicelogin, provides the code, and then goes through a normal AAD login flow. When Rush hears a response from Azure, the credential is dropped into the credential cache file in the user's HOME folder.
RUSH_BUILD_CACHE_CONNECTION_STRING
Environment VariableI foresee larger repos having CI machines populate a cloud-based cache and having developers only read from the cache. To facilitate that, the connection string can be set in an environment variable. We'll need to add support for turning on the "isCacheWriteAllowed"
option via environment as well.
Some of this should be done in follow-up PRs:
"filesystem"
cache provider, and make it an independent option (https://github.com/microsoft/rushstack/pull/2404)build-cache.json
file when rush init
is run"projectOutputFolderNames"
property in favor of options in project-specific config files (with extends+rig support) (https://github.com/microsoft/rushstack/pull/2403)Really impressed with the dedication here. My org will be testing the S3 provider as soon as something is ready to test. 🙂
Merged in the initial implementation PR. Summary of changes since my last post:
~/.rush/build-cache-credentials-cache.json
has been renamed to ~/.rush/credentials.json
rush update-build-cache-credentials
has been renamed to rush update-cloud-credentials
RUSH_BUILD_CACHE_CONNECTION_STRING
environment variable has been renamed to RUSH_BUILD_CACHE_CREDENTIAL
This has been published in Rush version 5.36.1. It's behind the "buildCache"
experiment.
I hacked together an S3 implementation (https://github.com/raihle/rushstack/tree/aws-build-cache), but there are a lot of issues with it:
<void>
type parameters to a handful of Promise
s which resolve with no resultasync
from an abstract
method, which can't await
anything anyways since there is no method bodyskipLibCheck
in tsconfig to work around some typing issues in the AWS SDK
Should I open a PR to discuss, or do we keep it here?
Finally had the chance to test out the build cache experiment.
I was hoping, that because the new mechanism includes the command being run, that Rush could handle the following scenario in a smart way:
Project a and b both depend c. I have an additional parameter --test
for build
(and rebuild
) in Rush’s command-line.json. Depending on --test
, the build output will be emitted to a different subdirectory in the dist directory. When I start with a clean repo, the following is happening:
rush build -t a --test one
rush build -t a --test one
rush build -t b --test one
rush build -t a --test two
rush build -t a --test two
rush build -t b --test one
Last command could be skipped because b and c were already built with this build state including the additional parameter --test one
Update:
I've just realized that the build cache feature only kicks in, if there is a rush-project.json (locally or via rig) with at least one "projectOutputFolder" (as indicated in the Rush log output...). Now the scenario from above is working, and I love it! 😍
Currently I don't see any way in just using the skip feature, but don't actually cache the project outputs? My use case: I’m building several packages with emscripten. Based on the "command-line.json"-parameters --build-type (which affects the optimization) and --threading-model (which includes pthread/offscreen support) a dedicated build workspace (./workspace/${buildType}-${thradingModel}/
) and dist (./dist/${buildType}-${thradingModel}/
) folder is used by the build script.
All dependents will include the .wasm file based on the dist folder derived of their parameters. With this structure all output will reside in the projects workspace
and dist
folders and therefore there is no need for an actual cache.
I'm thinking about a “NullCache”-provider which will store the state hash on an empty file. So the skipping would still work, but without any copying/restoring of the build output.
Amazon S3 support has been released in Rush 5.42.0. Many thanks to @raihle for putting this together!
@iclanton with @microsoft/rush-lib
5.42.0
pnpm with strictPeerDependencies
reports the following
ERROR: @microsoft/rush-lib > @aws-sdk/client-s3 > @aws-sdk/middleware-retry: react-native-get-random-values@1.6.0 requires a peer of react-native@>=0.56 but none was installed
I agree with @octogonz 's comment; The S3 and Azure dependencies should be excluded from Rush..?!
Wouldn't it be nice to have a way to provide external/custom cache implementations? The S3/Azure implementations would go to a dedicated rushstack package, and I could provide an exotic implementation (like the null cache proposed above) as a package inside the repo.
with
@microsoft/rush-lib
5.42.0
pnpm withstrictPeerDependencies
reports the following ERROR:@microsoft/rush-lib > @aws-sdk/client-s3 > @aws-sdk/middleware-retry: react-native-get-random-values@1.6.0 requires a peer of react-native@>=0.56 but none was installed
@boardend This is a bug. I've opened https://github.com/microsoft/rushstack/issues/2547 to get it fixed.
@iclanton did the schema update? I get the following error using build-cache.json
in this PR's description:
$ rush build
Rush Multi-Project Build Tool 5.42.3 - https://rushjs.io
Node.js version is 12.18.0 (LTS)
Starting "rush build"
ERROR: JSON validation failed:
...\common\config\rush\build-cache.json
Error: #/
Data does not match any schemas from 'oneOf'
Error: #/
Additional properties not allowed: projectOutputFolderNames
Error: #/
Missing required property: azureBlobStorageConfiguration
Error: #/
Missing required property: amazonS3Configuration
Error: #/cacheProvider
No enum match for: filesystem
From searching github it looks like we should be using local-only
instead of filesystem
now? And from reading the comments maybe projectOutputFolderNames is moved to individual projects? Would be great if there was a .md
file somewhere with the latest instructions on how to start using - I'm also not sure what I should be looking out for, or if we should be adding rush-project.json to each subproject, etc. The feature sounds great, I'm just finding it a little hard to figure out what the state of it is!
Yeah it's local-only
. @octogonz mentioned he was going to set this up for his team and put together docs while he does that, but I'm not sure if he's had time to do that yet.
I'm currently playing with the S3 bucket support. One thing I haven't quite figured out how Rush's credential cache is supposed to interact with typical S3 user and CI flows.
For the CI environment, I would expect it's quite unusual to have a user (with AWS Key/Token credentials). Normally in enterprise you would assign a role to the machine that is running the build, and then the AWS SDK automatically discovers that role. You're avoiding use of the SDK, which is OK -- you could recommend that users instead make an AssumeRole call early in the build and save temporary credentials. However, these credentials would typically live in ~/.aws/credentials
on the build machine.
For individual users, in some cases you'll have actual AWS users with credentials, but you'll often have federated access. In this case the users don't have cacheable credentials -- they've obtained credentials (usually lasting 8-12 hours) through a process involving MFA or perhaps Okta login, and those temporary credentials, again, go in ~/.aws/credentials
by default.
(What I'm trying to figure out is the easiest way to make rush build work in both local and CI situations, without having to write lots of little helpers that copy credentials to/from Rush's credential cache and the normal AWS credentials file.)
EDIT: Minutes after posting this I discovered RUSH_BUILD_CACHE_WRITE_CREDENTIAL
, which is perfect. The only issue is that if you are authenticating to AWS via roles, you actually need three pieces of information, not two. I'll see if I can put up a PR for that.
About amazon provider: I see from discussion that it's been implemented using HTTPS because aws-sdk
package is a big boi, however these days a better one is available
https://www.npmjs.com/package/@aws-sdk/client-s3
It only knows how to talk to s3 and is tree-shakable AFAIK. Not sure if it's worth a rewrite, but wanted to make you aware.
One definite gain would be to easily support "Normal" ways to authenticate with AWS
I am testing the functionality with AWS S3 and I was wondering if it would be possible to override the cacheProvider
through an environment variable.
The behaviour that we would like to get is that we use an S3 bucket in our CI environment but developers continue using a local build cache (local-only
) on their machines.
Is there any interest in adding support for customizing the cacheEntryNamePattern
further?
I'm imagining support for some kind of environment variable token, which would be interpolated when Rush generates the cache entry key. Something like so:
/* build-cache.json */
{
"cacheEntryNamePattern": "[projectName:normalized]-[hash]-[envVar:NODE_VERSION]"
}
And then the parsePattern function would need to support looking for an [envVar:value]
token. We could hash the value of process.env[tokenAttribute]
to strip out any invalid characters.
This would help cover situations like I ran into, where the project hasn't changed, but the context around it in the CI environment has changed. For us, the build hadn't changed, but a change to the Node version used to build it led to new errors thrown in a Rush command used in the build. That command used the incremental build feature, so when the cache key hadn't changed between the two builds compiled with different Node versions, it didn't run that incremental command for all of the Rush packages in the repo. It wasn't until the repo saw more changes later in the day that the incremental command ran on all the packages, and surfaced a new failure.
Being able to factor the Node version (or any other aribtrary environment variable) into the cache key would have helped avoid that mistake.
Double the request of the @robmosca-travelperk had similar thoughts.
I just set up S3 caching and it works wonders and on CI it speeds up everything A LOT. Thanks! Although, few issues with dangerous cache hits, more below :)
An issue that I encountered, that seems to be similar to @istateside, is that the output of certain commands can change with the environment variable. So source code is the same, and everything is exactly the same, but global environment variables, that are going to be injected into the bundle, are different, but because for the current algorithm everything looks exactly the same, we get a cache hit and wrong bundle fetched from cache as a result. Maybe we can make certain commands per project sensitive to certain env variables so it will be included in the cache name pattern somehow?
So, let's say, for my libraries, it will be absent, as libs are not dependent nor inject any envs into resulting dist
, but for web apps, it will be something like.
rush-project.json
of the app
{
"operationName": "build",
"outputFolderNames": ["dist"],
"breakingEnvVars": ["MODE"]
}
So if it runs with MODE=development
it produces one cache entry, but if it runs with MODE=production
it produces another cache entry, so no clash happens and we get reliable caching of both.
I tried to get around that with some common-line.json
parameters for build
, but tsc
and tsup
and next
error out if an unknown parameter of --mode production
is passed to them (which is what Vite.js expects only), surely I can create a script to work around this issue too, but it will prevent me and newcomers too from gradual adoption of caching if it requires these hacks.
Also, disabling cache with env or parameter will be nice. So if you have such a situation like above, you can just RUSH_BUILD_CACHE_DISABLE=true rush build --only my-app
Here is how Turborepo handles that. https://turborepo.org/docs/features/caching#alter-caching-based-on-environment-variables-and-files Overall I think their phased build DSL and caching configs are pretty good. Nice to have as an inspiration.
Unsure about how global it is (I like the colocation of rush-project.json
), but overall, kinda good syntax. I hope phased builds configs of Rush are as good or even better :)
I think in Rush that rather than modifying the cache key, we'd want to model a dependency on environment variables as a dependency proper -- it would change the hash and the hashes of all other dependent projects in the monorepo.
For example, perhaps rush-project.json
could have a section dedicated to dependsOnEnvironment
which lists the names of environment variables, and those values are included in the hash value for that project.
I think in Rush that rather than modifying the cache key, we'd want to model a dependency on environment variables as a dependency proper -- it would change the hash and the hashes of all other dependent projects in the monorepo.
For example, perhaps
rush-project.json
could have a section dedicated todependsOnEnvironment
which lists the names of environment variables, and those values are included in the hash value for that project.
Just wanted to +1 the need for environment variables affecting cache. Not having this is pretty sad for frontend projects that bake environment into output.
Just want to emphasize that it should be per command, and not a global thing per project/package. As only 1 command may be sensetive to envs, but the rest, maybe, are fine with any envs so their cache can be reused across different environments. E.g. local development, CI/CD etc.
What is the easiest way to get the current/most-recent build-cache key-hits post a successful run?
I would like to be able to purge non-recent-relevant tarballs. Our build-cache dir can get quite quite big...
What is the easiest way to get the current/most-recent build-cache key-hits post a successful run?
I would like to be able to purge non-recent-relevant tarballs. Our build-cache dir can get quite quite big...
If we talk buckets in the cloud you can make a policy to remove unused files after some time they haven't been accessed.
If we talk locally, I think you can write a script that will go through the files and read their created date and delete respectively.
@RIP21 I totally could, but I also think it would be finicky script; I would personally opt for grepping around the rush-lib/rushstack-repo to find the bits and pieces I need to calculate the build-cache keys post run. It seems like it would be relatively easy to have rush dump the cache-hit keys after a build-run.
I found the s3 build-caching to be ok (tho I tried it quite a while ago)
We/me are/is using composite github workflows w/ github-actions-cache to push and pull the build-cache.
Summary
Rush has an incremental build feature, where projects with non-gitignored files and external dependencies that haven't changed since the last build aren't rebuilt. This feature has several limitations: it doesn't ensure the output files haven't changed, only the previous build is accounted for, and incremental builds are contained only to a single copy of a repo.
I propose a more robust build caching feature with the following structure:
common/config/temp
folder, a remote object store like Azure Storage, some machine-wide shared store, or a combination of thoseFeedback is welcome.
Proposed Implementation
I've put together an initial implementation in this pull request: https://github.com/microsoft/rushstack/pull/2390
I propose the following configuration files:
common/config/rush/build-cache.json
This file describes the cache store and the default project folders that are cached. In this initial implementation, I've decided upon two cache stores: a per-repo filesystem location and Azure Storage.
Here is an example per-repo filesystem location
common/config/rush/build-cache.json
file that stores the cache incommon/temp/build-cache
Here is an example Azure Storage
common/config/rush/build-cache.json
file that stores the cache in an Azure Storage account calledexample-account
, a container calledfiles
, and with anexample/
prefix for each cache entry:<project root>/.rush/build-cache.json
This file describes per-project overrides for build cache configuration.
Here is an example
<project root>/.rush/build-cache.json
file that adds alib-commonjs
cache folder:Here is an example
<project root>/.rush/build-cache.json
file that caches alib-commonjs
folder and ignores theprojectOutputFolderNames
property incommon/config/rush/build-cache.json
: