runatlantis / atlantis

Terraform Pull Request Automation
https://www.runatlantis.io
Other
7.82k stars 1.06k forks source link

Create a new flag / environment variable for the shared provider plugin-cache directory #3238

Open snorlaX-sleeps opened 1 year ago

snorlaX-sleeps commented 1 year ago

Community Note


Describe the user story

Using an EBS volume in EKS, Atlantis can be a very long running service. Using version pinning like ~> 4 or ~> 3 for providers like AWS (specifically AWS) will result in large numbers of providers being downloaded over time. Eventually several Gb's of providers will be present on disk, even though they are no longer in use. When using a small volume, e.g 5-10Gb for small Terraform repos, this can cause the disk to fill up and eventually cause plans to fail, even though the Terraform repos, with multiple workspaces and folders (10+ projects) only takes < 100Mb of space.

Specifically talking about larger providers of here, in the 200Mb+ range.

Creating a cron job to clean this out would be easy with a separate EFS volume for providers, as that allows multi-access, however with a single EBS volumes a sidecar container has to be created in the same pod, which requires building an image.

Describe the solution you'd like

Adding a flag / environment variable that allows setting the location of the plugin-cache, different to the main Atlantis data-dir. As the configuration of the location is the data-directory+constant it should be possible to add an override, similar to how the data-dir can have any location as long as Atlantis has access to it. As this is actually a Terraform variable, rather than an Atlantis variable, it should not affect Atlantis' functionality.

It is a similar request to https://github.com/runatlantis/atlantis/issues/916, however we are more concerned about the plugin cache than the Terraform repos or other Atlantis data, as the provider-cache is the only thing that continuously grows over time.

Describe the drawbacks of your solution

Unsure how a provider cache having a different location would be an issue, as it only affects Terraform rather than Atlantis' functionality.

If using EFS you could effectively share a provider-cache amongst several different Atlantis installations, but then you would be more likely to run into some theoretical issues if multiple start to plan at the same time e.g https://github.com/runatlantis/atlantis/issues/2242

Describe alternatives you've considered

Current issue - Disk space cleanup using a cron. As it's only the providers causing the issue, it is only the shared cache that needs cleaned out when they are no longer in use. Repos are deleted when atlantis unlock is executed. Current workaround - build a sidecar image with cron installed (currently using Debian) run in the same pod as Atlantis. Cannot run a separate pod or K8s cron job as they cannot access the EBS volume even when on the same node as Atlantis due to the limitations of the EBS CSI driver.

EFS cannot be used as the main Atlantis data dir due to how much slower it is for writing small files, which is basically what a terraform init && terraform plan is. "If we just bumped our volume size higher" then EFS would become significantly faster, but it would cost more. Using EBS for a large number of Atlanti' seems the most cost effective way to do it, with good RW speeds, but then clearing / managing the storage becomes slightly more manual.

Current cron for context / others


Find files that have not been accessed in the last 2 weeks and remove them from the data directory

find \
  "$ATLANTIS_DATA_DIR/plugin-cache/registry.terraform.io" \
  -mindepth 1 \
  -type f \
  -not \
  -newerat '-2 weeks' \
  -delete

Find all empty directories and delete them from the data directory

find \
  "$ATLANTIS_DATA_DIR/plugin-cache/registry.terraform.io" \
  -mindepth 1 \
  -type d \
  -empty \
  -delete

Related Issues

nitrocode commented 10 months ago

Thanks for sharing these commands. I formatted them and added a description for each one. It might be an easy command to wedge into a custom workflow in a pre workflow hook.

https://www.runatlantis.io/docs/pre-workflow-hooks.html#atlantis-command-targetting

For the find command that is native, you can use this. I've been testing this out and it's worked well for me.

repos:
  - id: /.*/
    pre_workflow_hooks:
      - description: Clean up old files
        commands: plan
        run: |
          last_accessed_weeks="2"
          dir_to_clean="$ATLANTIS_DATA_DIR/plugin-cache/registry.terraform.io"
          echo "Clean up old files in $dir_to_clean not accessed in the last $last_accessed_weeks weeks"
          # clean up old files
          find \
            "$dir_to_clean" \
            -type f \
            -atime +$(($last_accessed_weeks*7)) \
            -delete \
            -print
          # clean up empty dirs
          find \
            "$dir_to_clean" \
            -mindepth 1 \
            -type d \
            -empty \
            -delete

For the find command mentioned by OP, you need to apk add findutils because the flag -newerat is not in the default busybox find. I'd recommend the above native solution instead.

          find \
            "$dir_to_clean" \
            -mindepth 1 \
            -type f \
            -not \
            -newerat "-$last_accessed_weeks weeks" \
            -delete