[rush] Build cache feature.

microsoft / rushstack

Monorepo for tools developed by the Rush Stack community

https://rushstack.io/

Other

5.95k stars 598 forks source link

[rush] Build cache feature. #2393

Open iclanton opened 3 years ago

iclanton commented 3 years ago

Summary

Rush has an incremental build feature, where projects with non-gitignored files and external dependencies that haven't changed since the last build aren't rebuilt. This feature has several limitations: it doesn't ensure the output files haven't changed, only the previous build is accounted for, and incremental builds are contained only to a single copy of a repo.

I propose a more robust build caching feature with the following structure:

A project's build state is a hash, computed by hashing:
- its non-gitignored files
- its non-repo dependencies' package hashes from the shrinkwrap file
- its dependencies' hashes
- the command being run
After a build, project outputs are collected into a tarball and placed in a central store
- The store can be in the common/config/temp folder, a remote object store like Azure Storage, some machine-wide shared store, or a combination of those
- Configuration files describe which folders should be considered part of build output. I propose only whole folders are allowed in build cache for the sake of simplicity.
Before a project is built, its hash is computed and the cache is queried for that hash. If a cache entry is present, any existing output folders are deleted, the tarball from the previous build is extracted into the project folder, and the build is skipped

Feedback is welcome.

Proposed Implementation

I've put together an initial implementation in this pull request: https://github.com/microsoft/rushstack/pull/2390

I propose the following configuration files:

`common/config/rush/build-cache.json`

This file describes the cache store and the default project folders that are cached. In this initial implementation, I've decided upon two cache stores: a per-repo filesystem location and Azure Storage.

Here is an example per-repo filesystem location common/config/rush/build-cache.json file that stores the cache in common/temp/build-cache

{
  "cacheProvider": "filesystem",

  /**
   * A list of folder names under each project root that should be cached. These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib", "dist"]
}

Here is an example Azure Storage common/config/rush/build-cache.json file that stores the cache in an Azure Storage account called example-account, a container called files, and with an example/ prefix for each cache entry:

{
  "cacheProvider": "azure-storage",

  /**
   * A list of folder names under each project root that should be cached. These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib", "dist"],

  /**
   * A connection string for accessing the Azure storage account.
   */
  "connectionString": "BlobEndpoint=https://example-account.blob.core.windows.net/;SharedAccessSignature=<...SAS token...>",

  /**
   * The name of the container in the Azure storage account to use for build cache.
   */
  "storageContainerName": "files",

  /**
   * If set to true, allow writing to the cache. Defaults to false.
   */
  "isCacheWriteAllowed": true,

  /**
   * An optional prefix for cache item blob names.
   */
  "blobPrefix": "example"
}

`<project root>/.rush/build-cache.json`

This file describes per-project overrides for build cache configuration.

Here is an example <project root>/.rush/build-cache.json file that adds a lib-commonjs cache folder:

{
  /**
   * A list of folder names under the project root that should be cached, in addition to those
   * listed in common/config/rush/build-cache.json projectOutputFolderNames property.
   *
   * These folders should not be tracked by git.
   */
  "additionalProjectOutputFolderNames": ["lib-commonjs"]
}

Here is an example <project root>/.rush/build-cache.json file that caches a lib-commonjs folder and ignores the projectOutputFolderNames property in common/config/rush/build-cache.json:

{
  /**
   * A list of folder names under the project root that should be cached instead of those
   * listed in common/config/rush/build-cache.json projectOutputFolderNames property.
   *
   * These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib-commonjs"]
}

wbern commented 3 years ago

I love this, because I've tried to implement this twice myself, via docker registry images and also npm packages. In the end, it would fit much better within rush.

Maybe it would be a good idea to provide some kind of hooks in a js file so that people can upload/download the cache files in whatever way they choose? The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.

iclanton commented 3 years ago

@wbern

Maybe it would be a good idea to provide some kind of hooks in a js file so that people can upload/download the cache files in whatever way they choose?

I've thought about that. We've put together a pretty flexible plugin system in Heft, but so far I haven't been able to come up with a way that would cleanly work in Rush. Maybe a common/config/rush/plugins.json file that lists package names+versions, installs them, and calls an apply function when Rush boots up? That'd mean that plugins must be published to a package feed, which is unfortunate. It also potentially gets confusing when someone is using Rush's API. I don't love just plain JS hooks because this is the kind of thing that should probably be versioned and plugin authors should have a typesafe interface to code against. What are your thoughts here?

The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.

Azure Storage's connection strings are a simple way to specify a storage endpoint (i.e. - an account name) and an access token. I think we'll want to support some sort of interactive and AAD authentication, and probably some way to pass credentials in via env variable. I think those will be features that we'll need to figure out after some community feedback because what works for my team at Microsoft is almost definitely not going to work for everyone else. What are your concerns here? It'd be good to try to be aware of other scenarios as early as possible.

wbern commented 3 years ago

I don't love just plain JS hooks because this is the kind of thing that should probably be versioned and plugin authors should have a typesafe interface to code against. What are your thoughts here?

I don't know if this is a bad idea, or too much work, but what about this:

Put a hooks file in each project (.ts, .js, doesn't matter)
Put a script name in each project which will execute said hooks file (through tsc or others)
Add a configuration option in the caching implementation to point at the script name. If the script name is missing, use the regular way described the proposal, but if it's present, execute that script instead, and expect the correct outcome.
You would have to pass in parameters to the hooks file either via some parameter bypassing (like npm run cache-script -- --action="download"), or use some environment variables to perform this. This is the trade-off for escaping the limitation of plain .js files (the projects are now handling the runtime instead).

You could also have a seperate package.json script for each action/hook type, to avoid the parameter/env passing.

I think those will be features that we'll need to figure out after some community feedback because what works for my team at Microsoft is almost definitely not going to work for everyone else. What are your concerns here? It'd be good to try to be aware of other scenarios as early as possible.

My org uses Amazon S3 to store artifacts currently, and we intend to use the S3 buckets to serve the files through a CDN (Cloudfront), so that is one thing to consider.

We also have a widely established method of uploading files complete .tar.gz files of each monorepo project to Artifactory which we use to later serve the files through a CMS (unpacked). I'm not very familiar with the details you mention (like AAD authentication) so I'm not sure how these components would fit your proposal and ideas.

iclanton commented 3 years ago

Launching a separate process will probably impact perf for a 100% cached repo quite a bit. I also don't think the cache provider should be configureable per-project.

My org uses Amazon S3 to store artifacts currently, and we intend to use the S3 buckets to serve the files through a CDN (Cloudfront), so that is one thing to consider.

Oh yeah I absolutely want to also support S3 in addition to Azure Storage. I want to get that in soon after the initial implementation is checked in. The official AWS package is pretty enormous, so we'll probably want to call its APIs via REST, or use a trimmed down package.

We also have a widely established method of uploading files complete .tar.gz files of each monorepo project to Artifactory which we use to later serve the files through a CMS (unpacked). I'm not very familiar with the details you mention (like AAD authentication) so I'm not sure how these components would fit your proposal and ideas.

Would you want to use Artifactory as your cache store? That sounds more like a post-build rush publish than it does like a build cache.

wbern commented 3 years ago

Launching a separate process will probably impact perf for a 100% cached repo quite a bit. I also don't think the cache provider should be configureable per-project.

That's a very good point!

Ok, if S3 is getting first class support then that's great! Actually that's what a colleague of mine mentioned, I was not sure if you'd be up for that but that's great to hear.

Regarding artifactory, yes, first I thought it could be one and the same, but build caches probably deserve their own storage solution.

octogonz commented 3 years ago

Here's my initial feedback:

How would this interact with the multiphase model that you proposed in #2300? Would we cache each phase as a separate tarball? Or would only certain compositions of phases be cached? We don't need to implement that in the MVP of course, but it would be helpful for our design to consider it, so we don't need to overhaul the build cache config files when the multiphase feature is introduced.

The store can be in the common/config/temp folder, a remote object store like Azure Storage, some machine-wide shared store, or a combination of those

How does the "combination" of caches work? When installing, would Rush consult the caches in a specific order? In the proposed config file, "cacheProvider": "azure-storage" is a single setting, not an array.

common/config/rush/build-cache.json

{
  "cacheProvider": "filesystem",

  /**
   * A list of folder names under each project root that should be cached. These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib", "dist"]
}

The "projectOutputFolderNames" seem like an inherent property of the toolchain, not the monorepo. It seems like they should be specified in each project folder, via rig.json typically. It may seem convenient for common/config/rush/build-cache.json to specify "defaults", but that would mean that a project's configuration cannot be correctly interpreted without considering the monorepo config files, which is confusing, and incompatible with the idea of a rig being an NPM package. (This is the reason why we eliminated the Heft's global config files.)

  /**
  * A connection string for accessing the Azure storage account.
  */
 "connectionString": "BlobEndpoint=https://example-account.blob.core.windows.net/;SharedAccessSignature=<...SAS token...>",
Is <...SAS token...> a security credential? If so, it seems odd to be storing such credentials in the Git repo. If all the developers will reuse the same token to access a shared container, maybe they should put it in a non-Git place like ~/.rush/tokens.json. The ideal obviously would be to add users to a security group that has write permission to the container, so that each user has their own token. For example, if someone leaves the company, this way you could revoke their access without forcing everyone else to update their token. Does Azure support that? AWS does.

The connection string is a little hard to apply for more than a few use cases, unless I'm missing something.

I agree with @wbern. It would be better for our JSON schema to have separate config blocks for each supported provider, so that you get nice IntelliSense and doc comments for "azureServerUrl" and "azureOtherThing" versus "awsContainerName" etc. Munging everything into a text string makes it harder to understand what the options are. Also the validation and error reporting for malformed "connection strings" is typically not very good.

octogonz commented 3 years ago

Symlinks cannot be stored easily in a tarball. Rush should probably report an error if a symlink is encountered while archiving a file.

iclanton commented 3 years ago

@octogonz - here are some answers to your questions:

How would this interact with the multiphase model that you proposed in #2300? Would we cache each phase as a separate tarball? Or would only certain compositions of phases be cached? We don't need to implement that in the MVP of course, but it would be helpful for our design to consider it, so we don't need to overhaul the build cache config files when the multiphase feature is introduced.

I've thought about this. I think in the MVP of the multiphase feature, we'll just cache the output of all of each projects' phases, and consider all of the project's phases complete in the event of a cache hit. Because both of these features will be released as opt-in "experiments," we can tweak this design as we get more experience with it.

How does the "combination" of caches work? When installing, would Rush consult the caches in a specific order? In the proposed config file, "cacheProvider": "azure-storage" is a single setting, not an array.

Right now the MVP supports only a single cache provider. It'd be pretty easy to change the "filesystem" provider to be another option (something like "enableLocalRepoCache": true or something) that is queried before the other configured provider. I'd like to introduce a per-user global settings file that would allow someone to point their local cache at a system-wide location, so multiple clones of the same repo may share cache, but I think that's out of scope for the MVP.

The "projectOutputFolderNames" seem like an inherent property of the toolchain, not the monorepo. It seems like they should be specified in each project folder, via rig.json typically. It may seem convenient for common/config/rush/build-cache.json to specify "defaults", but that would mean that a project's configuration cannot be correctly interpreted without considering the monorepo config files, which is confusing, and incompatible with the idea of a rig being an NPM package. (This is the reason why we eliminated the Heft's global config files.)

I think I agree here - it probably should be an option in the per-project config, and those should have rig support. I think I'd like to make that change in a follow-up PR after the initial checkin. Should be a simple change, but it'll be easier to reason with as a standalone change.

Is <...SAS token...> a security credential? If so, it seems odd to be storing such credentials in the Git repo. If all the developers will reuse the same token to access a shared container, maybe they should put it in a non-Git place like ~/.rush/tokens.json.

Yes, it is a security credential. I've removed this property and, instead, created a file called ~/.rush/build-cache-credentials-cache.json and a rush update-build-cache-credentials command that allows users to explicitly specify credentials or to go through an interactive login flow.

The ideal obviously would be to add users to a security group that has write permission to the container, so that each user has their own token. For example, if someone leaves the company, this way you could revoke their access without forcing everyone else to update their token. Does Azure support that? AWS does.

Yes, Azure supports that.

I agree with @wbern. It would be better for our JSON schema to have separate config blocks for each supported provider, so that you get nice IntelliSense and doc comments for "azureServerUrl" and "azureOtherThing" versus "awsContainerName" etc. Munging everything into a text string makes it harder to understand what the options are. Also the validation and error reporting for malformed "connection strings" is typically not very good.

I've updated the design to stick the Azure-specific stuff in a "azureBlobStorageConfiguration" property.

iclanton commented 3 years ago

I've made some updates to the design after reviewing feedback.

Configuration Files

`common/config/rush/build-cache.json`

I've updated the schema of this file to drop the connection string property and stick all of the provider-specific stuff into a property. The filesystem version of the file looks the same:

{
  "cacheProvider": "filesystem",

  /**
   * A list of folder names under each project root that should be cached. These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib", "dist"]
}

The Azure Storage provider of the file looks like this now:

{
  "cacheProvider": "filesystem",

  /**
   * A list of folder names under each project root that should be cached. These folders should not be tracked by git.
   */
  "projectOutputFolderNames": ["lib", "dist"],

  "azureBlobStorageConfiguration": {
    /**
     * The name of the the Azure storage account to use for build cache.
     */
    "storageAccountName": "example-account",

    /**
     * The name of the container in the Azure storage account to use for build cache.
     */
    "storageContainerName": "files",

    /**
     * (OPTIONAL) The Azure environment the storage account exists in. Defaults to AzureCloud.
     */
    "azureEnvironment": "AzureCloud", // Can also be "AzureChinaCloud", "AzureUSGovernment", or "AzureGermanCloud"

    /**
     * An optional prefix for cache item blob names.
     */
    "blobPrefix": "example",

    /**
     * If set to true, allow writing to the cache. Defaults to false.
     */
    "isCacheWriteAllowed": true
  }
}

`<project root>/.rush/build-cache.json` ➔ `<project root>/config/rush/build-cache.json`

This file has been moved into the project's config/rush/ folder, with the same schema.

Credentials

I've introduced a ~/.rush/build-cache-credentials-cache.json file with the following format (for an "azure-blob-storage" cache provider as an example):

{
  "cacheEntries": {
    "azure-blob-storage|AzureCloud|example-account|files|cacheWriteAllowed": {
      "expires": 1608948745090,
      "credential": "BlobEndpoint=https://example-account.blob.core.windows.net/;SharedAccessSignature=<...SAS token...>"
    }
  }
}

`rush update-build-cache-credentials`

I've also introduced a rush update-build-cache-credentials command with several (mutually-exclusive) flags:

`--credential CREDENTIAL_STRING`

This parameter is used to explicitly specify a credential

`--interactive`/`-i`

This flag executes an interactive login flow. For Azure Blob Storage, that looks like this:

Rush Multi-Project Build Tool 5.35.2 (unmanaged) - https://rushjs.io
Node.js version is 10.22.1 (LTS)

Starting "rush update-build-cache-credentials"

 ╔══════════════════════════════════════════════════════════════════════════════════╗
 ║                  To sign in, use a web browser to open the page                  ║
 ║        https://microsoft.com/devicelogin and enter the code ABC123XYZ to         ║
 ║                                  authenticate.                                   ║
 ╚══════════════════════════════════════════════════════════════════════════════════╝

The user opens a browser, navigates to https://microsoft.com/devicelogin, provides the code, and then goes through a normal AAD login flow. When Rush hears a response from Azure, the credential is dropped into the credential cache file in the user's HOME folder.

`RUSH_BUILD_CACHE_CONNECTION_STRING` Environment Variable

I foresee larger repos having CI machines populate a cloud-based cache and having developers only read from the cache. To facilitate that, the connection string can be set in an environment variable. We'll need to add support for turning on the "isCacheWriteAllowed" option via environment as well.

Remaining Work

Some of this should be done in follow-up PRs:

[ ] More gracefully handle network issues
[x] Better logging
[x] Drop the "filesystem" cache provider, and make it an independent option (https://github.com/microsoft/rushstack/pull/2404)
[x] Add support for machine-wide filesystem cache (https://github.com/microsoft/rushstack/pull/2405)
[ ] Add an AWS S3 cache provider
[ ] Create a build-cache.json file when rush init is run
[x] Drop the repo-level "projectOutputFolderNames" property in favor of options in project-specific config files (with extends+rig support) (https://github.com/microsoft/rushstack/pull/2403)
[ ] Better support for cloud cache population only from CI
[ ] Add support for local cache expiration/purging

wbern commented 3 years ago

Really impressed with the dedication here. My org will be testing the S3 provider as soon as something is ready to test. 🙂

iclanton commented 3 years ago

Merged in the initial implementation PR. Summary of changes since my last post:

~/.rush/build-cache-credentials-cache.json has been renamed to ~/.rush/credentials.json
- Its schema now includes a "version" field.
- It now only stores the Azure Storage SAS instead of a full connection string
rush update-build-cache-credentials has been renamed to rush update-cloud-credentials
The RUSH_BUILD_CACHE_CONNECTION_STRING environment variable has been renamed to RUSH_BUILD_CACHE_CREDENTIAL

iclanton commented 3 years ago

This has been published in Rush version 5.36.1. It's behind the "buildCache" experiment.

raihle commented 3 years ago

I hacked together an S3 implementation (https://github.com/raihle/rushstack/tree/aws-build-cache), but there are a lot of issues with it:

I had to upgrade to TypeScript 4+ (https://github.com/aws/aws-sdk-js-v3/issues/1919) and make the following fixes across the repository:
- Add <void> type parameters to a handful of Promises which resolve with no result
- Remove async from an abstract method, which can't await anything anyways since there is no method body
- This involved touching some packages other than rush-lib, but I suppose the scope of the TypeScript update could be limited to just rush-lib somehow, I'm not familiar enough with rigs.
I activated skipLibCheck in tsconfig to work around some typing issues in the AWS SDK
- The result of the GetObjectCommand can be a NodeJS stream.Readable, or a DOM ReadableString, or a React-Native Blob. AFAIK the alternative is to tell TypeScript that it will be running in a node + browser + react-native environment, which seems worse. This is apparently a recommended flag anyways...?
I updated the schemas by hand, but they kind of look like they are generated from TypeScript source, can you explain what the expected workflow is?
The AWS SDK depends on some react-native packages which are unused (at least when running on Node), I've overridden them in pnpmfile to avoid having a peer dependency on react-native.
I'm not sure how to go about collecting credentials
- Currently I rely on AWS CLI being installed and configured. AWS tokens are split into two parts (id and secret) and I'm not familiar with anything like the interactive flow for Azure.
- I could simply expect a string on the format "ID=;SECRET=", or (for the interactive version) prompt the user to enter their ID and secret? In either case they would collect the credentials from the AWS management console.
- I expect that the most common scenario would be to install and configure AWS CLI (and for rush to rely on credentials from that) or inject the credentials as files or command-line parameters when running in CI.

Should I open a PR to discuss, or do we keep it here?

boardend commented 3 years ago

Finally had the chance to test out the build cache experiment.

I was hoping, that because the new mechanism includes the command being run, that Rush could handle the following scenario in a smart way: Project a and b both depend c. I have an additional parameter --test for build (and rebuild) in Rush’s command-line.json. Depending on --test, the build output will be emitted to a different subdirectory in the dist directory. When I start with a clean repo, the following is happening:

rush build -t a --test one
- Rush will build project a and c
rush build -t a --test one
- Rush will skip project a and c
rush build -t b --test one
- Rush will build project b and skip c
rush build -t a --test two
- Rush will build project a and c
rush build -t a --test two
- Rush will skip project a and c
rush build -t b --test one
- Rush will build project b and c

Last command could be skipped because b and c were already built with this build state including the additional parameter --test one

Update:

I've just realized that the build cache feature only kicks in, if there is a rush-project.json (locally or via rig) with at least one "projectOutputFolder" (as indicated in the Rush log output...). Now the scenario from above is working, and I love it! 😍

Currently I don't see any way in just using the skip feature, but don't actually cache the project outputs? My use case: I’m building several packages with emscripten. Based on the "command-line.json"-parameters --build-type (which affects the optimization) and --threading-model (which includes pthread/offscreen support) a dedicated build workspace (./workspace/${buildType}-${thradingModel}/) and dist (./dist/${buildType}-${thradingModel}/) folder is used by the build script.

All dependents will include the .wasm file based on the dist folder derived of their parameters. With this structure all output will reside in the projects workspace and dist folders and therefore there is no need for an actual cache. I'm thinking about a “NullCache”-provider which will store the state hash on an empty file. So the skipping would still work, but without any copying/restoring of the build output.

iclanton commented 3 years ago

Amazon S3 support has been released in Rush 5.42.0. Many thanks to @raihle for putting this together!

boardend commented 3 years ago

@iclanton with @microsoft/rush-lib 5.42.0 pnpm with strictPeerDependencies reports the following ERROR: @microsoft/rush-lib > @aws-sdk/client-s3 > @aws-sdk/middleware-retry: react-native-get-random-values@1.6.0 requires a peer of react-native@>=0.56 but none was installed

I agree with @octogonz 's comment; The S3 and Azure dependencies should be excluded from Rush..?!

Wouldn't it be nice to have a way to provide external/custom cache implementations? The S3/Azure implementations would go to a dedicated rushstack package, and I could provide an exotic implementation (like the null cache proposed above) as a package inside the repo.

octogonz commented 3 years ago

with @microsoft/rush-lib 5.42.0 pnpm with strictPeerDependencies reports the following ERROR: @microsoft/rush-lib > @aws-sdk/client-s3 > @aws-sdk/middleware-retry: react-native-get-random-values@1.6.0 requires a peer of react-native@>=0.56 but none was installed

@boardend This is a bug. I've opened https://github.com/microsoft/rushstack/issues/2547 to get it fixed.

mmkal commented 3 years ago

@iclanton did the schema update? I get the following error using build-cache.json in this PR's description:

$ rush build

Rush Multi-Project Build Tool 5.42.3 - https://rushjs.io
Node.js version is 12.18.0 (LTS)

Starting "rush build"

ERROR: JSON validation failed:
...\common\config\rush\build-cache.json
Error: #/
Data does not match any schemas from 'oneOf'
Error: #/
Additional properties not allowed: projectOutputFolderNames
Error: #/
Missing required property: azureBlobStorageConfiguration
Error: #/
Missing required property: amazonS3Configuration
Error: #/cacheProvider
No enum match for: filesystem

From searching github it looks like we should be using local-only instead of filesystem now? And from reading the comments maybe projectOutputFolderNames is moved to individual projects? Would be great if there was a .md file somewhere with the latest instructions on how to start using - I'm also not sure what I should be looking out for, or if we should be adding rush-project.json to each subproject, etc. The feature sounds great, I'm just finding it a little hard to figure out what the state of it is!

iclanton commented 3 years ago

Yeah it's local-only. @octogonz mentioned he was going to set this up for his team and put together docs while he does that, but I'm not sure if he's had time to do that yet.

elliot-nelson commented 3 years ago

I'm currently playing with the S3 bucket support. One thing I haven't quite figured out how Rush's credential cache is supposed to interact with typical S3 user and CI flows.

For the CI environment, I would expect it's quite unusual to have a user (with AWS Key/Token credentials). Normally in enterprise you would assign a role to the machine that is running the build, and then the AWS SDK automatically discovers that role. You're avoiding use of the SDK, which is OK -- you could recommend that users instead make an AssumeRole call early in the build and save temporary credentials. However, these credentials would typically live in ~/.aws/credentials on the build machine.

For individual users, in some cases you'll have actual AWS users with credentials, but you'll often have federated access. In this case the users don't have cacheable credentials -- they've obtained credentials (usually lasting 8-12 hours) through a process involving MFA or perhaps Okta login, and those temporary credentials, again, go in ~/.aws/credentials by default.

(What I'm trying to figure out is the easiest way to make rush build work in both local and CI situations, without having to write lots of little helpers that copy credentials to/from Rush's credential cache and the normal AWS credentials file.)

EDIT: Minutes after posting this I discovered RUSH_BUILD_CACHE_WRITE_CREDENTIAL, which is perfect. The only issue is that if you are authenticating to AWS via roles, you actually need three pieces of information, not two. I'll see if I can put up a PR for that.

Faithfinder commented 2 years ago

About amazon provider: I see from discussion that it's been implemented using HTTPS because aws-sdk package is a big boi, however these days a better one is available https://www.npmjs.com/package/@aws-sdk/client-s3

It only knows how to talk to s3 and is tree-shakable AFAIK. Not sure if it's worth a rewrite, but wanted to make you aware.

One definite gain would be to easily support "Normal" ways to authenticate with AWS

robmosca commented 2 years ago

I am testing the functionality with AWS S3 and I was wondering if it would be possible to override the cacheProvider through an environment variable.

The behaviour that we would like to get is that we use an S3 bucket in our CI environment but developers continue using a local build cache (local-only) on their machines.

istateside commented 2 years ago

Is there any interest in adding support for customizing the cacheEntryNamePattern further?

I'm imagining support for some kind of environment variable token, which would be interpolated when Rush generates the cache entry key. Something like so:

/* build-cache.json */
{
    "cacheEntryNamePattern": "[projectName:normalized]-[hash]-[envVar:NODE_VERSION]"
}

And then the parsePattern function would need to support looking for an [envVar:value] token. We could hash the value of process.env[tokenAttribute] to strip out any invalid characters.

This would help cover situations like I ran into, where the project hasn't changed, but the context around it in the CI environment has changed. For us, the build hadn't changed, but a change to the Node version used to build it led to new errors thrown in a Rush command used in the build. That command used the incremental build feature, so when the cache key hadn't changed between the two builds compiled with different Node versions, it didn't run that incremental command for all of the Rush packages in the repo. It wasn't until the repo saw more changes later in the day that the incremental command ran on all the packages, and surfaced a new failure.

Being able to factor the Node version (or any other aribtrary environment variable) into the cache key would have helped avoid that mistake.

RIP21 commented 2 years ago

Double the request of the @robmosca-travelperk had similar thoughts.

I just set up S3 caching and it works wonders and on CI it speeds up everything A LOT. Thanks! Although, few issues with dangerous cache hits, more below :)

An issue that I encountered, that seems to be similar to @istateside, is that the output of certain commands can change with the environment variable. So source code is the same, and everything is exactly the same, but global environment variables, that are going to be injected into the bundle, are different, but because for the current algorithm everything looks exactly the same, we get a cache hit and wrong bundle fetched from cache as a result. Maybe we can make certain commands per project sensitive to certain env variables so it will be included in the cache name pattern somehow?

So, let's say, for my libraries, it will be absent, as libs are not dependent nor inject any envs into resulting dist, but for web apps, it will be something like.

rush-project.json of the app

    {
      "operationName": "build",
      "outputFolderNames": ["dist"],
      "breakingEnvVars": ["MODE"]   
    }

So if it runs with MODE=development it produces one cache entry, but if it runs with MODE=production it produces another cache entry, so no clash happens and we get reliable caching of both.

I tried to get around that with some common-line.json parameters for build, but tsc and tsup and next error out if an unknown parameter of --mode production is passed to them (which is what Vite.js expects only), surely I can create a script to work around this issue too, but it will prevent me and newcomers too from gradual adoption of caching if it requires these hacks.

Also, disabling cache with env or parameter will be nice. So if you have such a situation like above, you can just RUSH_BUILD_CACHE_DISABLE=true rush build --only my-app

RIP21 commented 2 years ago

Here is how Turborepo handles that. https://turborepo.org/docs/features/caching#alter-caching-based-on-environment-variables-and-files Overall I think their phased build DSL and caching configs are pretty good. Nice to have as an inspiration.

Unsure about how global it is (I like the colocation of rush-project.json), but overall, kinda good syntax. I hope phased builds configs of Rush are as good or even better :)

elliot-nelson commented 2 years ago

I think in Rush that rather than modifying the cache key, we'd want to model a dependency on environment variables as a dependency proper -- it would change the hash and the hashes of all other dependent projects in the monorepo.

For example, perhaps rush-project.json could have a section dedicated to dependsOnEnvironment which lists the names of environment variables, and those values are included in the hash value for that project.

Faithfinder commented 2 years ago

I think in Rush that rather than modifying the cache key, we'd want to model a dependency on environment variables as a dependency proper -- it would change the hash and the hashes of all other dependent projects in the monorepo.

For example, perhaps rush-project.json could have a section dedicated to dependsOnEnvironment which lists the names of environment variables, and those values are included in the hash value for that project.

Just wanted to +1 the need for environment variables affecting cache. Not having this is pretty sad for frontend projects that bake environment into output.

RIP21 commented 2 years ago

Just want to emphasize that it should be per command, and not a global thing per project/package. As only 1 command may be sensetive to envs, but the rest, maybe, are fine with any envs so their cache can be reused across different environments. E.g. local development, CI/CD etc.

jessekrubin commented 2 years ago

What is the easiest way to get the current/most-recent build-cache key-hits post a successful run?

I would like to be able to purge non-recent-relevant tarballs. Our build-cache dir can get quite quite big...

RIP21 commented 2 years ago

What is the easiest way to get the current/most-recent build-cache key-hits post a successful run?

I would like to be able to purge non-recent-relevant tarballs. Our build-cache dir can get quite quite big...

If we talk buckets in the cloud you can make a policy to remove unused files after some time they haven't been accessed.

If we talk locally, I think you can write a script that will go through the files and read their created date and delete respectively.

jessekrubin commented 2 years ago

@RIP21 I totally could, but I also think it would be finicky script; I would personally opt for grepping around the rush-lib/rushstack-repo to find the bits and pieces I need to calculate the build-cache keys post run. It seems like it would be relatively easy to have rush dump the cache-hit keys after a build-run.

I found the s3 build-caching to be ok (tho I tried it quite a while ago)

We/me are/is using composite github workflows w/ github-actions-cache to push and pull the build-cache.

microsoft / rushstack

[rush] Build cache feature. #2393

Summary

Proposed Implementation

common/config/rush/build-cache.json

<project root>/.rush/build-cache.json

Configuration Files

common/config/rush/build-cache.json

<project root>/.rush/build-cache.json ➔ <project root>/config/rush/build-cache.json

Credentials

rush update-build-cache-credentials

--credential CREDENTIAL_STRING

--interactive/-i

RUSH_BUILD_CACHE_CONNECTION_STRING Environment Variable

Remaining Work

`common/config/rush/build-cache.json`

`<project root>/.rush/build-cache.json`

`common/config/rush/build-cache.json`

`<project root>/.rush/build-cache.json` ➔ `<project root>/config/rush/build-cache.json`

`rush update-build-cache-credentials`

`--credential CREDENTIAL_STRING`

`--interactive`/`-i`

`RUSH_BUILD_CACHE_CONNECTION_STRING` Environment Variable