Question: is there a common way to speed up pipelines?

Peedee2002 commented 5 months ago

Hi, I work for a team that manages an organisation with about 10 business units, managed under the 4-projects step. Running the build step when 4-projects is modified has been getting significantly slower as we add more projects. This makes sense, but I want to improve this situation as the pipeline has gotten to 25 minutes per run for just that planning step. Our pipeline uses the provided tf-wrapper file to run the sequence of init, plan, validate and apply. Adding paralellism to the script (by running all inits concurrently, then all plans concurrently, then all validates concurrently) did speed this up to below 10 minutes, which is great. However, this required us to modify the tf-wrapper.sh significantly and convert it to python. Also, it now requires all locks to be held at once. This means that only 1 plan or apply can run at a time, or wait for the lock. Is there any other approach that is used for managing this?

This repo is modified often, often for granting access to google APIs or giving specific permissions to primary accounts. Are there better ways of handling these tasks?

Peedee2002 commented 5 months ago

This is the script I use (it is now in python):

from concurrent.futures import Future, ProcessPoolExecutor
import glob
import subprocess
import os
import sys
import re
import multiprocessing

if __name__ == "__main__":
    action = sys.argv[1] if len(sys.argv) > 1 else None
    branch = sys.argv[2] if len(sys.argv) > 2 else None
    policysource = sys.argv[3] if len(sys.argv) > 3 else None
    project_id = sys.argv[4] if len(sys.argv) > 4 else None
    policy_type = sys.argv[5] if len(sys.argv) > 5 else None
    file_lock = multiprocessing.Lock()

    base_dir = os.getcwd()
    environments_regex = r"^(development|non-production|production|shared)$"

def tmp_plan(base_dir):
    return f"{base_dir}/tmp_plan"

def tf_apply(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run(f"terraform apply -lock-timeout=10m -input=false -auto-approve {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM APPLY *******************")
    print(f"      At environment: {tf_component}/{tf_env}    ")
    print(f"***************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_init(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run("terraform init -reconfigure", capture_output=True, shell=True)
    print(f"*************** TERRAFORM INIT *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_plan(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(tmp_plan(base_dir)):
        os.mkdir(tmp_plan(base_dir))

    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    p = subprocess.run(f"terraform plan -lock-timeout=10m -input=false -out {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM PLAN *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)

    return 0

def tf_show(path, tf_env, tf_component, base_dir):
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")

    os.chdir(path)
    p = subprocess.run(f"terraform show {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan", capture_output=True, shell=True)
    print(f"*************** TERRAFORM SHOW *******************")
    print(f"      At environment: {tf_component}/{tf_env}   ")
    print(f"**************************************************")
    print(p.stdout.decode())
    print(p.stderr.decode())
    p.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_validate(path, tf_env, policy_file_path, tf_component, base_dir, policy_type, project_id):
    if not policy_file_path:
        print("no policy repo found! Check the argument provided for policysource to this script.")
        return 0
    if not os.path.isdir(path):
        print(f"ERROR:  {path} does not exist")
        return -1

    os.chdir(path)
    terraform_show = subprocess.run(f"terraform show -json {tmp_plan(base_dir)}/{tf_component}-{tf_env}.tfplan > {tf_env}.json", capture_output=True, shell=True)
    terraform_show.check_returncode()
    if policy_type == "CLOUDSOURCE":
        file_lock.acquire()
        if not os.path.isdir(policy_file_path) or not os.listdir(policy_file_path):
            subprocess.run(f"gcloud source repos clone gcp-policies {policy_file_path} --project={project_id}", capture_output=True, shell=True)
            curr_dir = os.getcwd()
            os.chdir(policy_file_path)
            current_branch = subprocess.run("git symbolic-ref --short HEAD", capture_output=True, shell=True)
            if current_branch != "main":
                subprocess.run("git checkout main", capture_output=True, shell=True)
            os.chdir(curr_dir)
        file_lock.release()
    validation = subprocess.run(f"gcloud beta terraform vet {tf_env}.json --policy-library={policy_file_path} --project={project_id}", capture_output=True, shell=True)
    print(f"*************** TERRAFORM VALIDATE ******************")
    print(f"      At environment: {tf_component}/{tf_env} ")
    print(f"      Using policy from: {policy_file_path} ")
    print(f"*****************************************************")
    print(validation.stdout.decode())
    print(validation.stderr.decode())
    validation.check_returncode()
    os.chdir(base_dir)
    return 0

def tf_loop_through_one(env_path, env, policysource, component, environments_regex, base_dir, policy_type, project_id, apply=False):
        if re.match(environments_regex, env):
            tf_init(env_path, env, component, base_dir)
            tf_plan(env_path, env, component, base_dir)
            tf_validate(env_path, env, policysource, component, base_dir, policy_type, project_id)
            if apply:
                tf_apply(env_path, env, component, base_dir)
        else:
            print(f"{component}/{env} doesn't match {environments_regex}; skipping")

def tf_loop_through_all(apply=False, environment=None):
    return_values: list[Future] = []
    processes_to_environment: dict[Future, str] = {}
    with ProcessPoolExecutor(max_workers=60) as pool:
        # check if a module has been changed via git diff
        for component_path in glob.glob(base_dir + '*/*/'):
            component = os.path.basename(component_path[:-1])
            if component in ["modules", "tmp_plan"]:
                continue
            for env_path in glob.glob(component_path + '/*/'):
                env = os.path.basename(env_path[:-1])
                if environment and env != environment:
                    continue
                print(f"queueing up {component}/{env}")
                ret = pool.submit(tf_loop_through_one, env_path, env, policysource, component, environments_regex, base_dir, policy_type, project_id, apply)
                processes_to_environment[ret] = f"{component}/{env}"
                return_values.append(ret)
        print('waiting for processes')
    print('done, checking for exceptions')
    for ret in return_values:
        if ret.exception() is not None:
            print(f"ERROR: exception occurred in {processes_to_environment[ret]}")
            print(ret.exception())
    return 1 if any(ret.exception() is not None for ret in return_values) else 0

def tf_single_action():
    for component_path in os.listdir(base_dir):
        component = os.path.basename(component_path)
        for env_path in os.listdir(component_path):
            env = os.path.basename(env_path)
            if env == branch or (env == "shared" and branch == "production"):
                match action:
                    case "apply":
                        return tf_apply(env_path, env, component)
                    case "init":
                        return tf_init(env_path, env, component)
                    case "plan":
                        return tf_plan(env_path, env, component)
                    case "show":
                        return tf_show(env_path, env, component)
                    case "validate":
                        return tf_validate(env_path, env, policysource, component)
                    case _:
                        print(f"unknown option: {action}")
                        return 99
            else:
                print(f"{env} doesn't match {branch}; skipping")
    return 0

if __name__ == "__main__":
    match action:
        case "init" | "plan" | "apply" | "show" | "validate":
            sys.exit(tf_single_action())
        case "plan_validate_all":
            sys.exit(tf_loop_through_all())
        case "plan_validate_apply_env":
            sys.exit(tf_loop_through_all(apply=True, environment=branch))
        case "plan_validate_apply_all":
            sys.exit(tf_loop_through_all(apply=True))
        case _:
            print(f"unknown option: {action}")
            sys.exit(99)

It is great for running things more quickly, but i am worried that it only "kicks the can down the road" so to speak.

eeaton commented 5 months ago

Hi @Peedee2002, here are a few options you might also consider, although the right choice for your circumstance will vary based on your operational requirements, like which capabilities your organization is willing to delegate to decentralized workload teams vs keeping in a centralized platform team.

since you already have 10 business units , it might make sense to shard the monolithic project repo for projects into project repos for individual business units. This is a common fix I've seen customers adopt at large scale.
- this can also be useful if different business units' repos are managed by different teams or have different forks of your CI/CD pipeline. But even if there is only one team that manages the project factory for all business units, this can speed up the plan and apply stages for individual repos.
depending on your operations, you might allow workload developers to set some attributes like API enablement in their own workload-specific repos, rather than asking the central platform team to set those as part of the project factory
- this could reduce some of the high frequency requests to the centralized platform team without sacrificing developer velocity
- some platform teams prefer to enable APIs centrally as a control against workload teams enabling services not approved by the platform team. If this is a concern in your org, you could enforce the org policy restrict resource service usage to enforce that only an approved allowlist of APIs can be enabled in the org, and allow developers to enable the APIs themselves within that list.

Hope that helps. I'll close this issue for now but feel free to re-open if needed.

eeaton commented 5 months ago

One other tactic that a colleague suggested:

Depending on where the bottleneck is, another tactical option to speed up might also be leveraging provider cache, disable refresh in some cases/have some logic to only actuate root modules with a diff and have a separate periodic pipeline for full reconciliation.

terraform-google-modules / terraform-example-foundation

Question: is there a common way to speed up pipelines? #1258