terraform-google-modules / terraform-example-foundation

Shows how the CFT modules can be composed to build a secure cloud foundation
https://cloud.google.com/architecture/security-foundations
Apache License 2.0
1.18k stars 701 forks source link

Question: is there a common way to speed up pipelines? #1258

Closed Peedee2002 closed 1 month ago

Peedee2002 commented 1 month ago

Hi, I work for a team that manages an organisation with about 10 business units, managed under the 4-projects step. Running the build step when 4-projects is modified has been getting significantly slower as we add more projects. This makes sense, but I want to improve this situation as the pipeline has gotten to 25 minutes per run for just that planning step. Our pipeline uses the provided tf-wrapper file to run the sequence of init, plan, validate and apply. Adding paralellism to the script (by running all inits concurrently, then all plans concurrently, then all validates concurrently) did speed this up to below 10 minutes, which is great. However, this required us to modify the tf-wrapper.sh significantly and convert it to python. Also, it now requires all locks to be held at once. This means that only 1 plan or apply can run at a time, or wait for the lock. Is there any other approach that is used for managing this?

This repo is modified often, often for granting access to google APIs or giving specific permissions to primary accounts. Are there better ways of handling these tasks?

Peedee2002 commented 1 month ago

This is the script I use (it is now in python):

from concurrent.futures import Future, ProcessPoolExecutor
import glob
import subprocess
import os
import sys
import re
import multiprocessing

if __name__ == "__main__":
    action = sys.argv[1] if len(sys.argv) > 1 else None
    branch = sys.argv[2] if len(sys.argv) > 2 else None
    policysource = sys.argv[3] if len(sys.argv) > 3 else None
    project_id = sys.argv[4] if len(sys.argv) > 4 else None
    policy_type = sys.argv[5] if len(sys.argv) > 5 else None
    file_lock = multiprocessing.Lock()

    base_dir = os.getcwd()
    environments_regex = r"^(development|non-production|production|shared)$"

def tmp_plan(base_dir):
    return f"{base_dir}/tmp_plan"

def tf_apply(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run(f"terraform apply -lock-timeout=10m -input=false -auto-approve {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM APPLY *******************")
    print(f"      At environment: {tf_component}/{tf_env}    ")
    print(f"***************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_init(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run("terraform init -reconfigure", capture_output=True, shell=True)
    print(f"*************** TERRAFORM INIT *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_plan(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(tmp_plan(base_dir)):
        os.mkdir(tmp_plan(base_dir))

    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run(f"terraform plan -lock-timeout=10m -input=false -out {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM PLAN *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)

    return 0

def tf_show(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")

    os.chdir(path)
    p = subprocess.run(f"terraform show {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM SHOW *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_validate(path, tf_env, policy_file_path, tf_component, base_dir, policy_type, project_id):
    if not policy_file_path:
        print("no policy repo found! Check the argument provided for policysource to this script.")
        return 0
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    terraform_show = subprocess.run(f"terraform show -json {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan > {tf_env}.json", capture_output=True, shell=True)
    terraform_show.check_returncode()
    if policy_type == "CLOUDSOURCE":
        file_lock.acquire()
        if not os.path.isdir(policy_file_path) or not os.listdir(policy_file_path):
            subprocess.run(f"gcloud source repos clone gcp-policies {policy_file_path} --project={project_id}", capture_output=True, shell=True)
            curr_dir = os.getcwd()
            os.chdir(policy_file_path)
            current_branch = subprocess.run("git symbolic-ref --short HEAD", capture_output=True, shell=True)
            if current_branch != "main":
                subprocess.run("git checkout main", capture_output=True, shell=True)
            os.chdir(curr_dir)
        file_lock.release()
    validation = subprocess.run(f"gcloud beta terraform vet {tf_env}.json --policy-library={policy_file_path} --project={project_id}", capture_output=True, shell=True)
    print(f"*************** TERRAFORM VALIDATE ******************")
    print(f"      At environment: {tf_component}/{tf_env} ")
    print(f"      Using policy from: {policy_file_path} ")
    print(f"*****************************************************")
    print(validation.stdout.decode())
    print(validation.stderr.decode())
    validation.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_loop_through_one(env_path, env, policysource, component, environments_regex, base_dir, policy_type, project_id, apply=False):
        if re.match(environments_regex, env):
            tf_init(env_path, env, component, base_dir)
            tf_plan(env_path, env, component, base_dir)
            tf_validate(env_path, env, policysource, component, base_dir, policy_type, project_id)
            if apply:
                tf_apply(env_path, env, component, base_dir)
        else:
            print(f"{component}/{env} doesn't match {environments_regex}; skipping")

def tf_loop_through_all(apply=False, environment=None):
    return_values: list[Future] = []
    processes_to_environment: dict[Future, str] = {}
    with ProcessPoolExecutor(max_workers=60) as pool:
        # check if a module has been changed via git diff
        for component_path in glob.glob(base_dir + '*/*/'):
            component = os.path.basename(component_path[:-1])
            if component in ["modules", "tmp_plan"]:
                continue
            for env_path in glob.glob(component_path + '/*/'):
                env = os.path.basename(env_path[:-1])
                if environment and env != environment:
                    continue
                print(f"queueing up {component}/{env}")
                ret = pool.submit(tf_loop_through_one, env_path, env, policysource, component, environments_regex, base_dir, policy_type, project_id, apply)
                processes_to_environment[ret] = f"{component}/{env}"
                return_values.append(ret)
        print('waiting for processes')
    print('done, checking for exceptions')
    for ret in return_values:
        if ret.exception() is not None:
            print(f"ERROR: exception occurred in {processes_to_environment[ret]}")
            print(ret.exception())
    return 1 if any(ret.exception() is not None for ret in return_values) else 0

def tf_single_action():
    for component_path in os.listdir(base_dir):
        component = os.path.basename(component_path)
        for env_path in os.listdir(component_path):
            env = os.path.basename(env_path)
            if env == branch or (env == "shared" and branch == "production"):
                match action:
                    case "apply":
                        return tf_apply(env_path, env, component)
                    case "init":
                        return tf_init(env_path, env, component)
                    case "plan":
                        return tf_plan(env_path, env, component)
                    case "show":
                        return tf_show(env_path, env, component)
                    case "validate":
                        return tf_validate(env_path, env, policysource, component)
                    case _:
                        print(f"unknown option: {action}")
                        return 99
            else:
                print(f"{env} doesn't match {branch}; skipping")
    return 0

if __name__ == "__main__":
    match action:
        case "init" | "plan" | "apply" | "show" | "validate":
            sys.exit(tf_single_action())
        case "plan_validate_all":
            sys.exit(tf_loop_through_all())
        case "plan_validate_apply_env":
            sys.exit(tf_loop_through_all(apply=True, environment=branch))
        case "plan_validate_apply_all":
            sys.exit(tf_loop_through_all(apply=True))
        case _:
            print(f"unknown option: {action}")
            sys.exit(99)

It is great for running things more quickly, but i am worried that it only "kicks the can down the road" so to speak.

eeaton commented 1 month ago

Hi @Peedee2002, here are a few options you might also consider, although the right choice for your circumstance will vary based on your operational requirements, like which capabilities your organization is willing to delegate to decentralized workload teams vs keeping in a centralized platform team.

Hope that helps. I'll close this issue for now but feel free to re-open if needed.

eeaton commented 1 month ago

One other tactic that a colleague suggested: