rules_gcs

Bazel rules for downloading files from Google Cloud Storage (GCS).

Features

Can be used as a drop-in replacement for http_file (gcs_file) and http_archive (gcs_archive)
Can fetch large amounts of objects lazily from a bucket using the gcs_bucket module extension
Supports fetching from private buckets using a credential helper
Uses Bazel's downloader and the repository cache
No dependencies like gsutil need to be installed[^1]

Installation

You can find the latest version of rules_gcs on the Bazel Central Registry. Installation works by adding a bazel_dep line to MODULE.bazel.

bazel_dep(name = "rules_gcs", version = "1.0.0")

Additionally, you need to vendor the credential helper in your own repository or install it in the $PATH:

mkdir -p tools
wget -O tools/credential-helper https://raw.githubusercontent.com/tweag/rules_gcs/main/tools/credential-helper
chmod +x tools/credential-helper

Bazel needs to be configured to use the credential helper for the Google Cloud Storage domain storage.googleapis.com in .bazelrc:

common --credential_helper=storage.googleapis.com=%workspace%/tools/credential-helper

# recommended optimization
common --experimental_repository_cache_hardlinks

It is important to limit the scope of the credential helper to that domain, since it does not yet support parsing of the requested uri.

Usage

rules_gcs offers two repository rules gcs_file and gcs_archive for fetching individual objects. If you need to download multiple objects from a bucket, use the gcs_bucket module extension instead.

To see how it all comes together, take a look at the full example.

gcs_archive

load("@rules_gcs//gcs:repo_rules.bzl", "gcs_archive")

gcs_archive(name, build_file, build_file_content, canonical_id, integrity, patch_strip, patches,
            rename_files, repo_mapping, sha256, strip_prefix, type, url)

Downloads a Bazel repository as a compressed archive file from a GCS bucket, decompresses it, and makes its targets available for binding.

It supports the following file extensions: "zip", "jar", "war", "aar", "tar", "tar.gz", "tgz", "tar.xz", "txz", "tar.zst", "tzst", tar.bz2, "ar", or "deb".

Examples: Suppose your code depends on a private library packaged as a .tar.gz which is available from gs://my_org_code/libmagic.tar.gz. This .tar.gz file contains the following directory structure:

 MODULE.bazel
  src/
    magic.cc
    magic.h

In the local repository, the user creates a magic.BUILD file which contains the following target definition:

  cc_library(
      name = "lib",
      srcs = ["src/magic.cc"],
      hdrs = ["src/magic.h"],
  )

Targets in the main repository can depend on this target if the following lines are added to MODULE.bazel:

  gcs_archive = use_repo_rule("@rules_gcs//gcs:repo_rules.bzl", "gcs_archive")

  gcs_archive(
      name = "magic",
      url = "gs://my_org_code/libmagic.tar.gz",
      sha256 = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      build_file = "@//:magic.BUILD",
  )

Then targets would specify @magic//:lib as a dependency.

ATTRIBUTES

Name	Description	Type	Mandatory	Default
name	A unique name for this repository.	Name	required
build_file	The file to use as the BUILD file for this repository.This attribute is an absolute label (use '@//' for the main repo). The file does not need to be named BUILD, but can be (something like BUILD.new-repo-name may work well for distinguishing it from the repository's actual BUILD files. Either build_file or build_file_content can be specified, but not both.	Label	optional	`None`
build_file_content	The content for the BUILD file for this repository. Either build_file or build_file_content can be specified, but not both.	String	optional	`""`
canonical_id	A canonical ID of the file downloaded. If specified and non-empty, Bazel will not take the file from cache, unless it was added to the cache by a request with the same canonical ID. If unspecified or empty, Bazel by default uses the URLs of the file as the canonical ID. This helps catch the common mistake of updating the URLs without also updating the hash, resulting in builds that succeed locally but fail on machines without the file in the cache.	String	optional	`""`
integrity	Expected checksum in Subresource Integrity format of the file downloaded. This must match the checksum of the file downloaded. It is a security risk to omit the checksum as remote files can change. At best omitting this field will make your build non-hermetic. It is optional to make development easier but either this attribute or `sha256` should be set before shipping.	String	optional	`""`
patch_strip	Strip the specified number of leading components from file names.	Integer	optional	`0`
patches	A list of files that are to be applied as patches after extracting the archive. It uses the Bazel-native patch implementation which doesn't support fuzz match and binary patch.	List of labels	optional	`[]`
rename_files	An optional dict specifying files to rename during the extraction. Archive entries with names exactly matching a key will be renamed to the value, prior to any directory prefix adjustment. This can be used to extract archives that contain non-Unicode filenames, or which have files that would extract to the same path on case-insensitive filesystems.	Dictionary: String -> String	optional	`{}`
repo_mapping	In `WORKSPACE` context only: a dictionary from local repository name to global repository name. This allows controls over workspace dependency resolution for dependencies of this repository. For example, an entry `"@foo": "@bar"` declares that, for any time this repository depends on `@foo` (such as a dependency on `@foo//some:target`, it should actually resolve that dependency within globally-declared `@bar` (`@bar//some:target`). This attribute is not supported in `MODULE.bazel` context (when invoking a repository rule inside a module extension's implementation function).	Dictionary: String -> String	optional
sha256	The expected SHA-256 of the file downloaded. This must match the SHA-256 of the file downloaded. It is a security risk to omit the SHA-256 as remote files can change. At best omitting this field will make your build non-hermetic. It is optional to make development easier but either this attribute or `integrity` should be set before shipping.	String	optional	`""`
strip_prefix	A directory prefix to strip from the extracted files. Many archives contain a top-level directory that contains all of the useful files in archive. Instead of needing to specify this prefix over and over in the `build_file`, this field can be used to strip it from all of the extracted files. For example, suppose you are using `foo-lib-latest.zip`, which contains the directory `foo-lib-1.2.3/` under which there is a `WORKSPACE` file and are `src/`, `lib/`, and `test/` directories that contain the actual code you wish to build. Specify `strip_prefix = "foo-lib-1.2.3"` to use the `foo-lib-1.2.3` directory as your top-level directory. Note that if there are files outside of this directory, they will be discarded and inaccessible (e.g., a top-level license file). This includes files/directories that start with the prefix but are not in the directory (e.g., `foo-lib-1.2.3.release-notes`). If the specified prefix does not match a directory in the archive, Bazel will return an error.	String	optional	`""`
type	The archive type of the downloaded file. By default, the archive type is determined from the file extension of the URL. If the file has no extension, you can explicitly specify one of the following: `"zip"`, `"jar"`, `"war"`, `"aar"`, `"tar"`, `"tar.gz"`, `"tgz"`, `"tar.xz"`, `"txz"`, `"tar.zst"`, `"tzst"`, `"tar.bz2"`, `"ar"`, or `"deb"`.	String	optional	`""`
url	A URL to a file that will be made available to Bazel. This must be a 'gs://' URL.	String	required

gcs_file

load("@rules_gcs//gcs:repo_rules.bzl", "gcs_file")

gcs_file(name, canonical_id, downloaded_file_path, executable, integrity, repo_mapping, sha256, url)

Downloads a file from a GCS bucket and makes it available to be used as a file group.

Examples: Suppose you need to have a large file that is read during a test and is stored in a private bucket. This file is available from gs://my_org_assets/testdata.bin. Then you can add to your MODULE.bazel file:

  gcs_file = use_repo_rule("@rules_gcs//gcs:repo_rules.bzl", "gcs_file")

  gcs_file(
      name = "my_testdata",
      url = "gs://my_org_assets/testdata.bin",
      sha256 = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
  )

Targets would specify @my_testdata//file as a dependency to depend on this file.

ATTRIBUTES

Name	Description	Type	Mandatory	Default
name	A unique name for this repository.	Name	required
canonical_id	A canonical ID of the file downloaded. If specified and non-empty, Bazel will not take the file from cache, unless it was added to the cache by a request with the same canonical ID. If unspecified or empty, Bazel by default uses the URLs of the file as the canonical ID. This helps catch the common mistake of updating the URLs without also updating the hash, resulting in builds that succeed locally but fail on machines without the file in the cache.	String	optional	`""`
downloaded_file_path	Optional output path for the downloaded file. The remote path from the URL is used as a fallback.	String	optional	`""`
executable	If the downloaded file should be made executable.	Boolean	optional	`False`
integrity	Expected checksum in Subresource Integrity format of the file downloaded. This must match the checksum of the file downloaded. It is a security risk to omit the checksum as remote files can change. At best omitting this field will make your build non-hermetic. It is optional to make development easier but either this attribute or `sha256` should be set before shipping.	String	optional	`""`
repo_mapping	In `WORKSPACE` context only: a dictionary from local repository name to global repository name. This allows controls over workspace dependency resolution for dependencies of this repository. For example, an entry `"@foo": "@bar"` declares that, for any time this repository depends on `@foo` (such as a dependency on `@foo//some:target`, it should actually resolve that dependency within globally-declared `@bar` (`@bar//some:target`). This attribute is not supported in `MODULE.bazel` context (when invoking a repository rule inside a module extension's implementation function).	Dictionary: String -> String	optional
sha256	The expected SHA-256 of the file downloaded. This must match the SHA-256 of the file downloaded. It is a security risk to omit the SHA-256 as remote files can change. At best omitting this field will make your build non-hermetic. It is optional to make development easier but should be set before shipping.	String	optional	`""`
url	A URL to a file that will be made available to Bazel. This must be a 'gs://' URL.	String	required

gcs_bucket

gcs_bucket = use_extension("@rules_gcs//gcs:extensions.bzl", "gcs_bucket")
gcs_bucket.from_file(name, bucket, dev_dependency, lockfile, method)

Downloads a collection of objects from a GCS bucket and makes them available under a single hub repository name.

Examples: Suppose your code depends on a collection of large assets that are used during code generation or testing. Those assets are stored in a private gcs bucket gs://my_org_assets.

In the local repository, the user creates a gcs_lock.json JSON lockfile describing the required objects, including their expected hashes:

    {
        "trainingdata/model/small.bin": {
            "sha256": "abd83816bd236b266c3643e6c852b446f068fe260f3296af1a25b550854ec7e5"
        },
        "trainingdata/model/medium.bin": {
            "sha256": "c6f9568f930b16101089f1036677bb15a3185e9ed9b8dbce2f518fb5a52b6787"
        },
        "trainingdata/model/large.bin": {
            "sha256": "b3ccb0ba6f7972074b0a1e13340307abfd5a5eef540c521a88b368891ec5cd6b"
        },
        "trainingdata/model/very_large.bin": {
            "remote_path": "weird/nested/path/extra/model/very_large.bin",
            "integrity": "sha256-Oibw8PV3cDY84HKv3sAWIEuk+R2s8Hwhvlg6qg4H7uY="
        }
    }

The exact format for the lockfile is a JSON object where each key is a path to a local file in the repository and the value is a JSON object with the following keys:

sha256: the expected sha256 hash of the file. Required unless integrity is used.
integrity: the expected SRI value of the file. Required unless sha256 is used.
remote_path: name of the object within the bucket. If not set, the local path is used.

Targets in the main repository can depend on this target if the following lines are added to MODULE.bazel:
```
gcs_bucket = use_extension("@rules_gcs//gcs:extensions.bzl", "gcs_bucket")
gcs_bucket.from_file(
  name = "trainingdata",
  bucket = "my_org_assets",
  lockfile = "@//:gcs_lock.json",
)
```
Then targets would specify labels like @trainingdata//:trainingdata/model/very_large.bin as a dependency.

TAG CLASSES

from_file

Attributes

Name	Description	Type	Mandatory	Default
name	Name of the hub repository containing referencing all blobs	Name	required
bucket	Name of the GCS bucket	String	required
dev_dependency	If true, this dependency will be ignored if the current module is not the root module or `--ignore_dev_dependency` is enabled.	Boolean	optional	`False`
lockfile	JSON lockfile containing objects to load from the GCS bucket	Label	required
method	Method used for downloading: `symlink`: lazy fetching with symlinks, `alias`: lazy fetching with alias targets, `copy`: lazy fetching with full file copies, `eager`: all objects are fetched eagerly	String	optional	`"symlink"`

Troubleshooting

Credential helper not found

WARNING: Error retrieving auth headers, continuing without: Failed to get credentials for 'https://storage.googleapis.com/broad-public-datasets/intervals_hg38.list' from helper 'tools/credential-helper': Cannot run program "tools/credential-helper" (in directory "..."): error=2, No such file or directory

You need to install download and vendor the credential helper script as explained above.

Credential helper not working

WARNING: Error retrieving auth headers, continuing without: Failed to get credentials for 'https://storage.googleapis.com/...' from helper 'tools/credential-helper': process timed out

Run gcloud auth application-default print-access-token to see why it fails and ensure you are logged in and have application default credentials configured.

HTTP 401 or 403 error codes
```
ERROR: Target parsing failed due to unexpected exception: java.io.IOException: Error downloading [https://storage.googleapis.com/...] to ...: GET returned 403 Forbidden
```
Ensure the user you are logged in as has access to the bucket using gsutil ls gs://<BUCKET_NAME>/<OBJECT> and check if the credential helper is configured in .bazelrc like this: --credential_helper=storage.googleapis.com=%workspace%/tools/credential-helper.
Checksum mismatch (empty file downloaded)
```
Error in wait: com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException: Checksum was e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 but wanted <actual>
```
Check if you are using --experimental_remote_downloader. If you do, the remote cache may drop your auth header and silently give you empty files instead. One workaround is setting --experimental_remote_downloader_local_fallback in .bazelrc.

Acknowledgements

_rules_gcs was initially developed by IMAX and is maintained by Tweag._

[^1]: The gcloud CLI tool is still required to obtain authentication tokens in the credential helper.

tweag / rules_gcs

readme