Open eze1981 opened 2 years ago
This is excellent work. I did/do not have the underlying fluency in the way AWS routes work on non-default VPCs - I just happen to have never touched that before - so thank you!
Some high level questions and comments for you:
Credentials and Code Stuff:
deployment_helpers/configuration_utils.py
file exists to validate credentials and provide useful feedback to the user, so you will need to update that.
This is because we have to support non-sysadmins - usually data scientists, so they can Python and SSH and report problems pretty well, but may not know AWS jargon - so any additions will need some frankly-pedantic-and-annoying-to-write-credential-validation-code, sorry.Code quibble:
You've removed a try-except block that handled an InvalidSecurityGroupNameException
- this styling is used intentionally to make the code explicit, and to make the code explode with a reportable stack trace and stop if some different error occurs. I do see other places where you are testing against None
- I haven't gone through in detail, but I would like you to use convert to exception casing wherever possible.
High level questions:
I'm using a non-default VPC just because it is what I have available in my account; the VPC is owned and shared from another account and that is the one that I want to use. But all the routing in default or non-default VPCs is the same.
It is a best practice to have in your public subnets (or DMZs) only the resources that need to be reached directly from the internet, for example, webservers. Then, databases and any other kind of background processing server can be buried into private subnets. That will prevent exposing your database by mistake for example if a wrong security group is associated with it.
For the dashboard implementation, integration between all the system components is managed today through the database. If you want to keep it simple, what about sending updates from the data processing server to the database, and logging the server activity in the DB? Then you can easily query from the database and identify if the server wasn't able to make any progress in the last x minutes on some particular task. Also, it would be convenient to improve the error handling on the data processing server to catch any exceptions and notify about them.
I noted that there was a lot of effort in the project to create the AWS infrastructure using directly the boto3 API. Did you consider switching to an IaC (Infrastructure as Code) approach? Tools like AWS CDK or Terraform can be very handy with that, and simplify a lot that piece. Also, if you implement that part in Terraform it will be a bit easier if in the future you want to add support for GCP or Azure. Are you interested in that?
Credentials and Code Stuff:
Code quibble:
That exception bit me a couple of times. I changed the way that the code in cluster_management\deployment_helpers\aws\security_groups.py
queries for a security group. I made that change because there are some parameters in the ec2.describe_security_groups
API that are not compatible with non-default VPCs. Current code should work in both scenarios, default and non-default VPC.
You are right; I removed the try/catch in cluster_management/deployment_helpers/aws/elastic_compute_cloud.py
to avoid an anti-pattern. The Try/catch block is used for control flow. Now that you mention it, I realized this: https://stackoverflow.com/questions/16138232/is-it-a-good-practice-to-use-try-except-else-in-python. LOL
The way that get_security_group_by_name
works now, after the change mentioned above, it should return None if there is no sg found. I can make it throw an "InvalidSecurityGroupNameException" exception instead, and add the try/catch back on elastic_compute_cloud.py to control the program flow. I'm not a Python dev. =)
(Hey, thanks for this, I've skimmed, I'm gonna need to find some time to review. In the meantime, can you make sure you are rebased off of main
? There shouldn't be conflicts.)
Current launch_script.py assumes that a default VPC is available in the AWS account. That is not always true, also it does not allow control of the subnets where different resources will be deployed. For example, it will be useful to deploy the ElasticBeanstalk webserver in a public subnet and keep all other resources (RDS database, manager, and worker nodes) in private subnets.
I have a version of the repo that supports deployments in a non-default VPC but it breaks compatibility with default VPC installations. =)
This is the commit on my fork that adds this capability: https://github.com/ORC-RIS/beiwe-backend/commit/9483dc41ab9838c1fcb6b814e5a996b42626e9e0
This solution depends on four new configuration settings:
I included those settings in _globalconfiguration.json file. Is this the correct place for this kind of configuration?
Also, we need to identify a mechanism to determine when deployment needs to happen in a default-VPC with no specification about the subnets, and when it needs to happen in a non-deafult VPC being explicit about the subnets for each type of resource.