Add support to deploy on a non-default AWS VPC

eze1981 commented 2 years ago

Current launch_script.py assumes that a default VPC is available in the AWS account. That is not always true, also it does not allow control of the subnets where different resources will be deployed. For example, it will be useful to deploy the ElasticBeanstalk webserver in a public subnet and keep all other resources (RDS database, manager, and worker nodes) in private subnets.

I have a version of the repo that supports deployments in a non-default VPC but it breaks compatibility with default VPC installations. =)

This is the commit on my fork that adds this capability: https://github.com/ORC-RIS/beiwe-backend/commit/9483dc41ab9838c1fcb6b814e5a996b42626e9e0

This solution depends on four new configuration settings:

RDS_SUBNET_GROUP_NAME
ELASTIC_BEANSTALK_SUBNETS
ELASTIC_BEANSTALK_ELB_SUBNETS
EC2_MANAGER_SUBNET

I included those settings in _globalconfiguration.json file. Is this the correct place for this kind of configuration?

Also, we need to identify a mechanism to determine when deployment needs to happen in a default-VPC with no specification about the subnets, and when it needs to happen in a non-deafult VPC being explicit about the subnets for each type of resource.

biblicabeebli commented 2 years ago

This is excellent work. I did/do not have the underlying fluency in the way AWS routes work on non-default VPCs - I just happen to have never touched that before - so thank you!

Some high level questions and comments for you:

I generally feel pretty good about the isolation of the database and the processing servers that we get with the security groups. Is there an advantage to additional isolation from routes, or are we just dealing with them because we have to configure them manually on non-default VPCs?
Long term, we want to add a ~dashboard page of active jobs running on data processing servers. I'm not thrilled about needing to open up a connection between the EB servers and data processors, and almost everything could be handled by some database tables.... but I can't work out how to solve "is the data processing server running right now or is this a stale representation of state because it exploded" without opening some network connection. How would adding the layer of managing VPC routes impact that? Beyond adding another AWS service here, do you have any thoughts on that.
I will need to document somewhere what the route structure is.

Credentials and Code Stuff:

Creds files here are less organized than is ideal, this is not your fault.
The deployment_helpers/configuration_utils.py file exists to validate credentials and provide useful feedback to the user, so you will need to update that. This is because we have to support non-sysadmins - usually data scientists, so they can Python and SSH and report problems pretty well, but may not know AWS jargon - so any additions will need some frankly-pedantic-and-annoying-to-write-credential-validation-code, sorry.
for your reference, thiiiiis gross commit was last update, it was when I added code that migrates configuration to a new EB Environment for Python 3.6->3.8: https://github.com/onnela-lab/beiwe-backend/commit/f8c543637f7005f2278ba11fc393a87f2f33a22e
The least-icky way to do this is to create a new optional settings file that sits next to the global parameters file. In the launch script have a new command option that includes the extra steps, validates the optional settings in its own, new, clean, non-gross validation function before executing.
Manually enforcing GLOBAL_SETTINGS.get("optional_param", "a default"), or NOT doing that and instead yoloing GLOBAL_SETTINGS["optional_param"], for optional variables everywhere, are not viable solutions. Forcing validation before the "real code" executes is how we do it elsewhere, and I want to continue that pattern.

Code quibble: You've removed a try-except block that handled an InvalidSecurityGroupNameException - this styling is used intentionally to make the code explicit, and to make the code explode with a reportable stack trace and stop if some different error occurs. I do see other places where you are testing against None - I haven't gone through in detail, but I would like you to use convert to exception casing wherever possible.

eze1981 commented 2 years ago

High level questions:

I'm using a non-default VPC just because it is what I have available in my account; the VPC is owned and shared from another account and that is the one that I want to use. But all the routing in default or non-default VPCs is the same.
It is a best practice to have in your public subnets (or DMZs) only the resources that need to be reached directly from the internet, for example, webservers. Then, databases and any other kind of background processing server can be buried into private subnets. That will prevent exposing your database by mistake for example if a wrong security group is associated with it.
For the dashboard implementation, integration between all the system components is managed today through the database. If you want to keep it simple, what about sending updates from the data processing server to the database, and logging the server activity in the DB? Then you can easily query from the database and identify if the server wasn't able to make any progress in the last x minutes on some particular task. Also, it would be convenient to improve the error handling on the data processing server to catch any exceptions and notify about them.
I noted that there was a lot of effort in the project to create the AWS infrastructure using directly the boto3 API. Did you consider switching to an IaC (Infrastructure as Code) approach? Tools like AWS CDK or Terraform can be very handy with that, and simplify a lot that piece. Also, if you implement that part in Terraform it will be a bit easier if in the future you want to add support for GCP or Azure. Are you interested in that?

Credentials and Code Stuff:

Adding validations to configuration_utils.py sounds like the right thing to do. I'll make the VPC ID variable optional, so it will not affect current deployments for default-vpcs, and validations for the subnets only if a custom vpc is specified. But why not keep using the same global_configuration.json file? We can keep it clean from any VPC or subnet configuration in the default flavor and have a separate page on the documentation that explains how to add the configuration variables for a non-default VPC.

Code quibble:

That exception bit me a couple of times. I changed the way that the code in cluster_management\deployment_helpers\aws\security_groups.py queries for a security group. I made that change because there are some parameters in the ec2.describe_security_groups API that are not compatible with non-default VPCs. Current code should work in both scenarios, default and non-default VPC.
You are right; I removed the try/catch in cluster_management/deployment_helpers/aws/elastic_compute_cloud.py to avoid an anti-pattern. The Try/catch block is used for control flow. Now that you mention it, I realized this: https://stackoverflow.com/questions/16138232/is-it-a-good-practice-to-use-try-except-else-in-python. LOL
The way that get_security_group_by_name works now, after the change mentioned above, it should return None if there is no sg found. I can make it throw an "InvalidSecurityGroupNameException" exception instead, and add the try/catch back on elastic_compute_cloud.py to control the program flow. I'm not a Python dev. =)

biblicabeebli commented 2 years ago

(Hey, thanks for this, I've skimmed, I'm gonna need to find some time to review. In the meantime, can you make sure you are rebased off of main? There shouldn't be conflicts.)

onnela-lab / beiwe-backend

Add support to deploy on a non-default AWS VPC #281