gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
7.93k stars 964 forks source link

Proposal: Create Content Adressable Store for Terragrunt #2923

Closed yhakbar closed 2 weeks ago

yhakbar commented 7 months ago

This is a proposal for addressing part of #2920. I've broken it out here into another issue to avoid bloating the main issue with a massive comment, especially given that this solution might not be desired, and the focus right now is on providing documentation.

One higher investment feature that might help is to build a content-addressable store (similar to what pnpm did to address the same issue of node_modules being super bloated) in a central location on a user's machine for Terragrunt at somewhere like ~/.terragrunt-store (or ~/.terragrunt-module-store or ~/.terragrunt-tf-module-store).

This would require:

  1. Performing a minimal clone of a module at a given ref into a temporary directory.

e.g.

$ cd $(mktemp -d)
$ git clone --bare --depth 1 --single-branch -b v0.54.22 https://github.com/gruntwork-io/terragrunt
  1. Inspect git to find the checksums of all the files within git for a particular Terraform/Tofu module to update the store as necessary with relevant blobs, traversing the tree as necessary.

e.g.

$ cd terragrunt.git
$ git cat-file -p $(git cat-file -p $(git rev-parse HEAD) | head -n1 | awk '{print $2}') | head -n5
040000 tree 18a76c119c20004c7db6149e45863eaa09cc3cd8    .circleci
040000 tree 3042cd98206816a0642de81c7c66a3453f3c563f    .github
100644 blob 3c0f84693064d04e84e327d7eab0356608d65233    .gitignore
100644 blob ddd5e95e1bb2054a88203e7ff0f14ea45e481a03    .golangci.yml
100644 blob 553725461982e770a9d73b87068d7319e44e058f    .gon_amd64.hcl
$ blob='3c0f84693064d04e84e327d7eab0356608d65233' 
$ test -f ~/.terragrunt-store/$blob || git cat-file -p $blob > ~/.terragrunt-store/$blob
...
$ git cat-file -p 18a76c119c20004c7db6149e45863eaa09cc3cd8
100644 blob 9e03dbb3dc67fb476d4e4b76ee975431292b31e7    config.yml
$ blob='9e03dbb3dc67fb476d4e4b76ee975431292b31e7' 
$ test -f ~/.terragrunt-store/$blob || git cat-file -p $blob > ~/.terragrunt-store/$blob
...
  1. Hard/soft link to the stored files when generating a .terragrunt-cache folder.

e.g.

$ mkdir -p .terragrunt-cache/...
$ ln ~/.terragrunt-store/3c0f84693064d04e84e327d7eab0356608d65233 .terragrunt-cache/.../.gitignore

Advantages

  1. There is one copy of every unique file in a TF module stored in the filesystem at a given time, even across versions of a module and across different modules, reducing disk bloat.
  2. Re-installs of a module should be close to free, as all the filesystem updates will just be linking the stored files to the .terragrunt-cache, speeding up CI run times.
  3. Reduces impact of "useless" TF module files (like documentation, CI, testing, etc), as that stuff will only take up more disk space when they change.

Drawbacks

  1. Definitely more complicated than just cloning the repo.
  2. Different filesystems handle support varying levels of linking, which might complicate this further.
  3. Initial Terragrunt inits for individual modules that experience a complete cache miss will be slower, as it's probably faster to just clone the repo than to do all the persistance logic as well.
denis256 commented 7 months ago

Currently, I think most of the space is consumed by Terraform provider's binaries, which are downloaded by Terraform, most probably we can somehow share single copy of providers but it will require to fetch somehow providers and link between modules

yhakbar commented 7 months ago

Wouldn't setting the provider-plugin-cache handle this on the TF side? Is there a race condition or something that prevents the provider binaries from being shared when being called indirectly by TG?

denis256 commented 7 months ago

Concurrent initialization may be an issue, Terraform docs:

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.
yhakbar commented 7 months ago

Could this approach theoretically make it safe to do concurrent initialization by populating a custom provider cache manually before doing any tf inits?

e.g.

# Run the equivalent of this prior to any terraform init.
❯ terragrunt version
Terraform v1.5.7
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v5.34.0
+ provider registry.terraform.io/hashicorp/helm v2.5.1
+ provider registry.terraform.io/hashicorp/tls v4.0.5

# Setup custom TF plugin cache for Terragrunt
❯ export TF_PLUGIN_CACHE_DIR="$HOME/.terragrunt-store/.terraform.d/plugin-cache"
❯ mkdir -p $TF_PLUGIN_CACHE_DIR 

# Download each of the TF plugins manually using the data from the `version` call above
❯ mkdir -p $TF_PLUGIN_CACHE_DIR/registry.terraform.io/hashicorp/aws/5.34.0/darwin_arm64
❯ cd $TF_PLUGIN_CACHE_DIR/registry.terraform.io/hashicorp/aws/5.34.0/darwin_arm64
❯ wget "https://releases.hashicorp.com/terraform-provider-aws/5.34.0/terraform-provider-aws_5.34.0_darwin_arm64.zip" -O terraform-provider-aws_5.34.0_darwin_arm64.zip
# Verify signature
❯ curl -s https://releases.hashicorp.com/terraform-provider-aws/5.34.0/terraform-provider-aws_5.34.0_SHA256SUMS | grep -q "$(sha256sum terraform-provider-aws_5.34.0_darwin_arm64.zip)"
# Extract
❯ unzip -q terraform-provider-aws_5.34.0_darwin_arm64.zip
❯ rm -f terraform-provider-aws_5.34.0_darwin_arm64.zip

For each group we're running, we would be able to run tf version in each of the uninitialized TF modules to get their versions, dedup them, then concurrently setup their provider binaries before initializing TF.

denis256 commented 7 months ago

Yes, need to think how can be handled

yhakbar commented 6 months ago

I created a prototype of this functionality in Rust with this: https://github.com/yhakbar/cln

I included some mini-benchmarks and an explanation of how it works here, along with some advantages/disadvantages/known issues I noticed as a consequence of implementing it.

If we're fans of the functionality, I can help with the Golang implementation, or I can fork the repo into the gruntwork-io organization, and we can run a separate Rust binary as part of the cloning process.

I think if we wanted to leverage this implementation for Terragrunt, the source URL would have to have a special flag to opt-in for this behavior, as there are two major flaws with it:

  1. The module that's used has to be read-only, as the content in the CAS has to be immutable for reliable re-usage across multiple .terragrunt-cache directories.
  2. The ref used in the terragrunt.hcl source can't be a commit, as the implementation doesn't support that. It's possible that someone better than me with git can figure out how to do that efficiently, though.
brikis98 commented 6 months ago

I created a prototype of this functionality in Rust with this: https://github.com/yhakbar/cln

I included some mini-benchmarks and an explanation of how it works here, along with some advantages/disadvantages/known issues I noticed as a consequence of implementing it.

If we're fans of the functionality, I can help with the Golang implementation, or I can fork the repo into the gruntwork-io organization, and we can run a separate Rust binary as part of the cloning process.

I think if we wanted to leverage this implementation for Terragrunt, the source URL would have to have a special flag to opt-in for this behavior, as there are two major flaws with it:

  1. The module that's used has to be read-only, as the content in the CAS has to be immutable for reliable re-usage across multiple .terragrunt-cache directories.
  2. The ref used in the terragrunt.hcl source can't be a commit, as the implementation doesn't support that. It's possible that someone better than me with git can figure out how to do that efficiently, though.

Oh, very cool!

A few thoughts/questions:

  1. How do we integrate this with TG and TF? TG uses go-getter to download whatever is in the source URL... TF does too. The point is that in neither case do we call git clone directly. Would we have to somehow intercept the git clone calls while TG is running?
  2. When you run cln <URL> in folder foo, you store the code in the central content store, and you then create symlinks from foo to each file in the central content store? So then if we are using generate blocks in Terragrunt, or copying files form the working dir, all of that goes into foo and not the central content store, right?
  3. What are you actually storing in the central content store? What is the format of data storage? What happens when someone re-runs init on the same URL? Or a new one?
yhakbar commented 6 months ago

1

I think we would want to either:

To utilize cln or an internal golang implementation of the same logic that uses a CAS when fetching the source, instead of using go-getter.

It would be important to have this be a way to not use go-getter at all, as:

  1. The solution cln uses is not able to support some functionality that go-getter has like utilizing a specific commit.
  2. It's not helpful when the source isn't remote, like when using a local path.
  3. The module used does not work the same way as when using go-getter, as the files have to be read-only to ensure integrity of the CAS.

2

If you're in the folder foo, and you run cln <URL>, you'll have a folder <repo name> in foo with each file in the directory <repo name> being a hard link to the appropriate stored file in the CAS. If a generate block, etc ends up adding files to foo/<repo name>, they will either be net-new files that won't interact at all with the CAS, or they will overwrite the hard link with the generated real file if they have the same name as something in the CAS.

3

This is what the contents of the ~/.cln-store directory look like after running cln:

$ cln https://github.com/gruntwork-io/terragrunt.git
Cloning into bare repository '/var/folders/x3/j561187d7bn7j25xf6hs73wr0000gn/T/cln.AUT8aYH0lFuZ'...
remote: Enumerating objects: 1945, done.
remote: Counting objects: 100% (1945/1945), done.
remote: Compressing objects: 100% (1426/1426), done.
remote: Total 1945 (delta 176), reused 1581 (delta 141), pack-reused 0
Receiving objects: 100% (1945/1945), 4.06 MiB | 14.29 MiB/s, done.
Resolving deltas: 100% (176/176), done.
~/tmp took 2s 
$ ls ~/.cln-store | head -n5
0a55bbcb55453db6fa28e8a21b400a13988b637d
0a63fcb7fdb31927cb91548ec2ad8eba71f8e4fe
0a208eecc2ff325e4e1129b3c20f135063a33889
0a86533fa9d829c9a9cfcda7bf8357b996d3f9e5
0a415612a210e20dd504b0c04c6f6c73107e55e9

Each of those files correspond to either a tree or blob object in git.

When cln is first run, it executes the following to determine what the tree ref is for the current remote HEAD (or branch or tag, if you pass in the -b flag):

$ git ls-remote https://github.com/gruntwork-io/terragrunt HEAD
43c95b202180d45c1cf8f790d09835f8afe2a53e    HEAD

That tells cln whether or not it needs to perform a fetch. If the file ~/.cln-store/43c95b202180d45c1cf8f790d09835f8afe2a53e exists, it can reconstruct the entire <repo name> directory from the CAS without doing any git operation.

If it doesn't exist, it will perform a minimal clone to get the contents of the git repo (note that if the -b flag is passed in, the equivalent -b flag will be passed to the clone to have git clone the appropriate branch/tag):

$ git clone --bare --depth 1 --single-branch https://github.com/gruntwork-io/terragrunt
Cloning into bare repository 'terragrunt.git'...
remote: Enumerating objects: 1945, done.
remote: Counting objects: 100% (1945/1945), done.
remote: Compressing objects: 100% (1426/1426), done.
remote: Total 1945 (delta 176), reused 1581 (delta 141), pack-reused 0
Receiving objects: 100% (1945/1945), 4.06 MiB | 14.64 MiB/s, done.
Resolving deltas: 100% (176/176), done.

From this point on, cln starts storing content in the CAS using git commands.

It runs the following to get git to tell it what that initial tree ref corresponds to:

# In `terragrunt.git`
$ git ls-tree '43c95b202180d45c1cf8f790d09835f8afe2a53e' | head -n5
040000 tree f07d2b1062e60846533ae393ae29610ae835609f    .circleci
040000 tree 3042cd98206816a0642de81c7c66a3453f3c563f    .github
100644 blob 3c0f84693064d04e84e327d7eab0356608d65233    .gitignore
100644 blob ddd5e95e1bb2054a88203e7ff0f14ea45e481a03    .golangci.yml
100644 blob 553725461982e770a9d73b87068d7319e44e058f    .gon_amd64.hcl

It then stores that as a file in the CAS:

$ cat ~/.cln-store/43c95b202180d45c1cf8f790d09835f8afe2a53e | head -n5
040000 tree f07d2b1062e60846533ae393ae29610ae835609f    .circleci
040000 tree 3042cd98206816a0642de81c7c66a3453f3c563f    .github
100644 blob 3c0f84693064d04e84e327d7eab0356608d65233    .gitignore
100644 blob ddd5e95e1bb2054a88203e7ff0f14ea45e481a03    .golangci.yml
100644 blob 553725461982e770a9d73b87068d7319e44e058f    .gon_amd64.hcl

Then recursively does the same thing for all the nested tree objects and stores them in the CAS as well.

All the blob objects are stored as files in the CAS, named after their SHA-1 hash, as contents of the object:

# The `.gitignore` blob above
$ cat ~/.cln-store/3c0f84693064d04e84e327d7eab0356608d65233 | head -n5
.*.sw?
.idea
terragrunt.iml
vendor
.terraform

The <repo name> directory is then constructed by creating a hard link for each blob that was discovered while walking the tree from the relevant filename to the hash it corresponds to. This is the part that makes it a CAS, all the way in which content is stored and referenced is by the hash of the content.

As such, each link in the directory has the same inode as the file in the CAS (meaning they are using the same physical space on disk).

# In terragrunt
$ ls -li .gitignore | awk '{print $1}'
24520012
$ ls -li ~/.cln-store/3c0f84693064d04e84e327d7eab0356608d65233 | awk '{print $1}'
24520012

So, with the exception of the .git directory being gone, and the whole directory being read-only, the <repo name> directory is a perfect replica of a directory that would be there if you had done a full clone.

On subsequent runs of cln for the same repo at the same state, cln will run the same git ls-remote command, and realize that it already has the tree ref in the CAS, and can perform the same algorithm to reconstruct the <repo name> directory without doing any other git operations, as it has all the tree and blob refs in the CAS.

For any other URLs or for the same URL, but a different branch/tag, cln will do the same minimal clone, persist the contents in the CAS, and construct the relevant directory. Note that any hash collisions will result in less work done when persisting the CAS (and less space consumed). If a new tag of a module only changes one file, for example, cln will only have to persist that one changed file, and the rest will be recycled from the existing entries in the CAS. This means that multiple references to the same module take up the same space on disk, and changes to modules only take up an incremental increase in space corresponding to the files that have changed.

brikis98 commented 6 months ago

1

I think we would want to either:

  • Adjust the configurations available in the terraform block in Terragrunt to include a cas field.
  • Or add a custom query parameter to the source field in the terraform block in Terragrunt to include a cas parameter.

In that case, would this only work for the source URL in the root TG module, but not any of the modules that it references internally? I think, for us, that wouldn't be a great ratio, as each root-level module ("service") tends to include 3-6 sub-modules, so we'd be building a CAS that only gets used for a small percentage of all modules downloaded/used.

It seems like it would only be worth it if the CAS was used by both TG and TF. So I wonder if there is some Git config magic or similar we can do to inject this behavior for all Git URLs for anyone that opts into using the CAS?

2

If you're in the folder foo, and you run cln <URL>, you'll have a folder <repo name> in foo with each file in the directory <repo name> being a hard link to the appropriate stored file in the CAS. If a generate block, etc ends up adding files to foo/<repo name>, they will either be net-new files that won't interact at all with the CAS, or they will overwrite the hard link with the generated real file if they have the same name as something in the CAS.

Roger, that's great.

3

This is what the contents of the ~/.cln-store directory look like after running cln:

$ cln https://github.com/gruntwork-io/terragrunt.git
Cloning into bare repository '/var/folders/x3/j561187d7bn7j25xf6hs73wr0000gn/T/cln.AUT8aYH0lFuZ'...
remote: Enumerating objects: 1945, done.
remote: Counting objects: 100% (1945/1945), done.
remote: Compressing objects: 100% (1426/1426), done.
remote: Total 1945 (delta 176), reused 1581 (delta 141), pack-reused 0
Receiving objects: 100% (1945/1945), 4.06 MiB | 14.29 MiB/s, done.
Resolving deltas: 100% (176/176), done.
~/tmp took 2s 
$ ls ~/.cln-store | head -n5
0a55bbcb55453db6fa28e8a21b400a13988b637d
0a63fcb7fdb31927cb91548ec2ad8eba71f8e4fe
0a208eecc2ff325e4e1129b3c20f135063a33889
0a86533fa9d829c9a9cfcda7bf8357b996d3f9e5
0a415612a210e20dd504b0c04c6f6c73107e55e9

Each of those files correspond to either a tree or blob object in git.

When cln is first run, it executes the following to determine what the tree ref is for the current remote HEAD (or branch or tag, if you pass in the -b flag):

$ git ls-remote https://github.com/gruntwork-io/terragrunt HEAD
43c95b202180d45c1cf8f790d09835f8afe2a53e  HEAD

Do you do the git ls-remote if the source URL has a ref parameter that is either a Git tag or specific commit ID? The commit ID would already be a sha1 hash, and the tag is a pointer to one that we could potentially store. In other words, once we've fetched a commit/tag into the store, we wouldn't need to query the remote Git repo ever again for that same commit/tag.

That tells cln whether or not it needs to perform a fetch. If the file ~/.cln-store/43c95b202180d45c1cf8f790d09835f8afe2a53e exists, it can reconstruct the entire <repo name> directory from the CAS without doing any git operation.

If it doesn't exist, it will perform a minimal clone to get the contents of the git repo (note that if the -b flag is passed in, the equivalent -b flag will be passed to the clone to have git clone the appropriate branch/tag):

$ git clone --bare --depth 1 --single-branch https://github.com/gruntwork-io/terragrunt
Cloning into bare repository 'terragrunt.git'...
remote: Enumerating objects: 1945, done.
remote: Counting objects: 100% (1945/1945), done.
remote: Compressing objects: 100% (1426/1426), done.
remote: Total 1945 (delta 176), reused 1581 (delta 141), pack-reused 0
Receiving objects: 100% (1945/1945), 4.06 MiB | 14.64 MiB/s, done.
Resolving deltas: 100% (176/176), done.

From this point on, cln starts storing content in the CAS using git commands.

It runs the following to get git to tell it what that initial tree ref corresponds to:

# In `terragrunt.git`
$ git ls-tree '43c95b202180d45c1cf8f790d09835f8afe2a53e' | head -n5
040000 tree f07d2b1062e60846533ae393ae29610ae835609f  .circleci
040000 tree 3042cd98206816a0642de81c7c66a3453f3c563f  .github
100644 blob 3c0f84693064d04e84e327d7eab0356608d65233  .gitignore
100644 blob ddd5e95e1bb2054a88203e7ff0f14ea45e481a03  .golangci.yml
100644 blob 553725461982e770a9d73b87068d7319e44e058f  .gon_amd64.hcl

It then stores that as a file in the CAS:

$ cat ~/.cln-store/43c95b202180d45c1cf8f790d09835f8afe2a53e | head -n5
040000 tree f07d2b1062e60846533ae393ae29610ae835609f  .circleci
040000 tree 3042cd98206816a0642de81c7c66a3453f3c563f  .github
100644 blob 3c0f84693064d04e84e327d7eab0356608d65233  .gitignore
100644 blob ddd5e95e1bb2054a88203e7ff0f14ea45e481a03  .golangci.yml
100644 blob 553725461982e770a9d73b87068d7319e44e058f  .gon_amd64.hcl

Then recursively does the same thing for all the nested tree objects and stores them in the CAS as well.

All the blob objects are stored as files in the CAS, named after their SHA-1 hash, as contents of the object:

# The `.gitignore` blob above
$ cat ~/.cln-store/3c0f84693064d04e84e327d7eab0356608d65233 | head -n5
.*.sw?
.idea
terragrunt.iml
vendor
.terraform

The <repo name> directory is then constructed by creating a hard link for each blob that was discovered while walking the tree from the relevant filename to the hash it corresponds to. This is the part that makes it a CAS, all the way in which content is stored and referenced is by the hash of the content.

As such, each link in the directory has the same inode as the file in the CAS (meaning they are using the same physical space on disk).

# In terragrunt
$ ls -li .gitignore | awk '{print $1}'
24520012
$ ls -li ~/.cln-store/3c0f84693064d04e84e327d7eab0356608d65233 | awk '{print $1}'
24520012

Cool, thanks for the explanation!

So, with the exception of the .git directory being gone, and the whole directory being read-only, the <repo name> directory is a perfect replica of a directory that would be there if you had done a full clone.

Ah, the lack of .git may be an issue. I believe that in multiple places in TG we look for the "Git root" in a repo, which I believe is found using the .git folder. If it's not there, those may not work.

yhakbar commented 6 months ago

In that case, would this only work for the source URL in the root TG module, but not any of the modules that it references internally? I think, for us, that wouldn't be a great ratio, as each root-level module ("service") tends to include 3-6 sub-modules, so we'd be building a CAS that only gets used for a small percentage of all modules downloaded/used.

It seems like it would only be worth it if the CAS was used by both TG and TF. So I wonder if there is some Git config magic or similar we can do to inject this behavior for all Git URLs for anyone that opts into using the CAS?

Ya, it would only help prior to the tf init. It would help with TF modules that have submodules within the same repo, but once TF starts fetching, I think we should allow TF to handle how it manages optimizing that portion.

There's an issue created last month that tries to address this for OpenTofu: https://github.com/opentofu/opentofu/issues/1199

Do you do the git ls-remote if the source URL has a ref parameter that is either a Git tag or specific commit ID? The commit ID would already be a sha1 hash, and the tag is a pointer to one that we could potentially store. In other words, once we've fetched a commit/tag into the store, we wouldn't need to query the remote Git repo ever again for that same commit/tag.

This is true for the tag and branches, ya. cln effectively does the following when the ref is a tag or branch:

$ git ls-remote https://github.com/gruntwork-io/terragrunt master | rg refs/heads/master
1e3a5373c46d0e6c58b025c79899ff56dac75814    refs/heads/master
$ git ls-remote https://github.com/gruntwork-io/terragrunt v0.55.10 | rg refs/tags/v0.55.10
1e3a5373c46d0e6c58b025c79899ff56dac75814    refs/tags/v0.55.10

The reason commits are tricky (with my current git knowledge) is that:

  1. You can't do the same ls-remote for a commit to confirm that the commit actually exists in the remote repo:
$ git ls-remote https://github.com/gruntwork-io/terragrunt 1e3a5373c46d0e6c58b025c79899ff56dac75814
$ git ls-remote https://github.com/gruntwork-io/terragrunt a-fake-commit
  1. You can't do a shallow, single branch clone of a repo at a given commit:
# this works
$ git clone --bare --depth 1 --single-branch -b v0.55.10 https://github.com/gruntwork-io/terragrunt
# this does too
$ git clone --bare --depth 1 --single-branch -b master https://github.com/gruntwork-io/terragrunt
# this doesn't
$ git clone --bare --depth 1 --single-branch -b 1e3a5373c46d0e6c58b025c79899ff56dac7581s https://github.com/gruntwork-io/terragrunt

As a consequence, a full clone has to take place, then the appropriate commit can be checked out.

This is a special case that can be accounted for, but it just means that CAS creation will be slower for commits, as a full clone has to take place for the initial population. It also means that the logic for the short-circuit has to be complicated a bit, as we have to be able to identify the ref as a commit, rather than a tag or branch, then avoid doing the ls-remote and do the full clone + checkout instead.

Ah, the lack of .git may be an issue. I believe that in multiple places in TG we look for the "Git root" in a repo, which I believe is found using the .git folder. If it's not there, those may not work.

Do we look for the Git root of the TF module or are you referring to things like this which look for the git root of the TG repo?

If it's absolutely necessary that the .git directory exist in the module, it's also feasible to add logic to persist that. We would just have to also persist the contents of the git clone --bare, then restore that as a .git directory.

It would be slower than not doing it, and would require manually calculating the checksums, as I don't think there's a good way for git to give us the checksums of the files that store git metadata.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for raising this issue.