Fortify pre-release testing regime

gwright99 commented 4 months ago

Background

Official Seqera deployment guidance for Tower implementations is that clients should use managed database & redis instances to ensure better data protection and application stability (i.e. RDS and Elasticache). This is a good idea.

The use of managed instances is problematic in testing scenarios, however, as it introduces significant delays into the testing cycle (~10+ minutes for stand-up, and often longer for teardown due in-place modifications to remove teardown protections on the database). Furthermore, use of the external instances incurs real-world charges from AWS for each testing cycle using the managed instances.

Given these realities, current internal testing tends to favour deployments which use the containerized db & redis. While this is often not a problem (e.g. the database implementation doesn't matter if you are testing conditional logic re: how the tower.yml is populated), it means the RDS/Elasticache logic is being exercised less on a daily basis and this can result in bugs being unintentionally introduced and not caught until later.

The simplest way to ensure more robustness of the project is to introduce a CICD solution which can run automated testing/checks against PRs / master branch. This would remove the cognitive burden from project resources to manually try alternative solutions and allow us to publish test results (which would allow us to support clients installing in more regulated / qualified environments).

Unfortunately, the introduction of CICD doesn't solve the underlying problems of:

Where are you going to test?
How much is it going to cost?
How long is it going to take?

Proposal

No matter what, we always need to run final sanity checks in a real AWS account to ensure things truly work the way we expect them in a true cloud environment. But we don't need to be exclusively testing in AWS 100%.

We should consider making better use of a localized testing harness to minimize costs and speed up testing execution. In the case of AWS, the use of Localstack (Pro) can provide much of the functionality we need on a daily basis (Terrafrom support, AWS API compliance, spawning of containers emulating the docker-compose host and associated Tower credentials).

I think it would be worthwhile to get an end-to-end process fully working, without requiring too much deviation from the official project (localstack pro still has some limitations which does not allow a 100% apples-to-apples parity).

Counter Arguments

A few arguments can be made against this effort. I have thoughts on each but won't comment yet as I'd like to see what others say:

This is supposed to be a field tool, why are we adding extra layers of complexity into a "use at your own discretion" tool?
Manual testing is fine - the solution is not big enough to warrant a more complex solution.
(Assuming use of localstack) the localstack solution just isn't comparable enough to real AWS to be worth investing in.

Next Steps

Discuss this topic with CX resources & Seqera leadership to determine our desired approach.
Implement trial effort to assess feasibility.

gwright99 commented 4 months ago

The opening of this ticket was a bit of an after-the-fact thing, as I've already been working on getting a localstack implementation running and almost have the whole thing working on a local VM (e2e deployment ~1 minute instead of the ~15min going against real AWS).

There are still a few final problems to solve (e.g. localstack uses docker-in-docker, so the "ec2"container - while able to spawn the various Tower containers - sees the Tower containers failing due to not being able to find /tower.yml; some problems with Ansible package installation which required baking of packages into a custom image treated as the AMI).

Assuming these last few problems can be solved, we will have a working model that a simple CICD can be pointed against.

gwright99 commented 1 month ago

Update

I've given up on the Localstack solution (for now) and pivoted towards building a testing strategy that is (1) easy to implement; (2) portable across different cloud providers.

Outlining thoughts and ideas below for inclusion of @schaluva in go-forward initiatives.

(Perceived) Testing Needs

It strikes me that there are at least 4 different testing options that should be considered.

Unit Testing:
1. Test scripts used to generate content upon which other TF modules rely. Example: generate_db_connection_string.py
Pre-Deployment:
1. Verification of terraform plan (i.e. try to catch problems before time/money is spent on deployment):
  1. Verification by 3rd party testing framework (i.e. pytest). Requires more setup but seems more flexible to be able to check inter-connected resources.
  2. Verification via terraform test. Native functionality but IMO contrained by limitations of HCL and resource scope.
2. Verification of target environment.
  1. Confirming that values provided in SSM / terraform.tfvars actually make sense as a whole (i.e. subnet CIDRs match the CIDR range of an existing VPC).
System Testing
1. Verify that Tower instance can be reached.
2. Verify that supporting Tower components work (groundswell, connect).
3. Verify that compute jobs can reach Tower.

Testing Considerations

Given the technology stack in the project (Terraform, Python, Ansible), I'm thinking two pillars should underlie our testing initiative since this allows us to reuse existing components while containing necessary new learnings to already-known technologies:

Python (pytest, home-rolled solution)
Terraform

I'll open with a hot-take: despite the need for some additional setup effort, I think using Python makes more sense. While terraform test is available OOTB, it feels very limited re: what it can test, how fast it can test, and how much we can do prior to having to do an actual deployment (which costs money and takes time, at least until we can find a way to make Localstack work (for AWS)).

With that said:

I find initial setup of pytest took significant effort - both in terms of dealing with basic python import mechanisms, VSCode configuration, pytest fixtures, and pytests syntax idiosyncracies.
Terraform test has its own idiosyncracies and has the same problem that prompted me to start using Python external data scripts in the first place - it isnt a true programming language and this makes writing complex conditional / inter-connected resource logic difficult.
There is an argument that terraform test is a bad idea in general since it tests the wrong thing. I am partial to this argument (but could be convinced by a strong argument otherwise).
I like the idea of making Python the core of the testing solution as it aligns well to my longer-term "generate multiple terraform.tfvars files representing different scenarios so you can run multiple regression tests in parallel". I am very biased and this has not yet been proven a successful strategy in reality.

Granted, other tools may make more sense to use (e.g. Terratest).

POC

POC branch here. Caveat: Components work but design is not clean or streamlined.

Implements 3 different testing flows:

Unit testing of Python scripts.
Pre-deployment testing (via pytest) against terraform plan output.
Pre-deployment testing (via terraform test).

Quasi-Done:

System testing quasi happens with the Terraform Installer via:
1. SSH connection to EC2 (successful or not).
2. Ansible execution (successful or not -- points to networking)
3. Seqerakit execution.

Missing

Inter-resource verification.
More in-depth systemic tests.

schaluva commented 1 month ago

Update

Some preliminary testing demonstrated that there is not an easy solution to handle running tests with GA that require the information in the tfvars file. For example:

The tfvars file contains information about the aws account & account objects which Terraform must retrieve as part of the terraform plan process. The non-templated tfvars file is only present locally, so repo-based executions must somehow find a way to source this information when running GHA tests.
The SSM secrets used as part of the terraform plan process can leak into the publicly-visible GHA log

Workaround options to make the tfvars available to GA tests exist (e.g. save the file to cloud storage and pull it in as part of the GHA script) but - on first glance - feel kludgy.

In order to minimize unnecessary overhead and to proceed faster to building a test suite, the development focus will shift from a remote test suite to a local test suite. Considerations will be made to determine how we can develop tests locally with an eye towards supporting the same tests on GA in the future once a solution can be isolated to support sensitive value requirements.

The downside of testing locally is that there is no public facing audit of tests before merging changes but as an intermediate, we may consider generating and storing log files from testing locally.

gwright99 commented 6 hours ago

Scenarios to implement (organic list):

sg_ingress_cidrs array populated with exact same value as VPC CIDR.
sg_ingress_cidrs array populated with different value than VPC CIDR.
- Make sure machine with IP in sg_ingress_cidr can reach Tower instance.

seqeralabs / cx-field-tools-installer