Mega-Task: DevOps Future State Architecture

rssk commented 2 years ago

Task Description

High-Level Technical Goals

Design a new approach to CI/CD, continuing to leverage CircleCI and Docker. Solve for one deployment environment first such as for testing. Also, do a phased architecture approach that describes what we should do immediately versus the long-term.
Design a new approach to Multi-Stage Deployment, focusing on a one-click-deploy QAT environment using Terraform.
Design a new approach to container management with a goal to start following better patterns.
Design a new approach to decouple the back-end packages from the front-end packages.
Design a new approach to decouple the pipeline manager from the rest of the backend packages.
Write infrastructure-as-code (IaC) to implement all designs.
Use the new IaC to deploy new dev, QAT, and prod environments. We can then have multiple environments spun up that we can easily tear down and rebuild.

Business Outcomes To Support With These Designs

Establish better patterns for separating concerns through decoupling. This will allow our devs to work for effectively and efficiently.
Architect the underlying infrastructure to serve the software application better. This will allow us to optimize cost, performance, and scalability better.
We can cost optimize our cloud usage using elastic approaches. We pay for only what we use better and have more options on where we deploy things.
We can cost optimize our cloud usage by purchasing reserved instances.
We can save money by creating a flexible solution to deploy to on-prem as needed.

Examples Use Cases

With the pipeline manager and computations, we can leverage more elastic cloud services by decoupling them from the other backend services. This will allow us to distribute the load better and handle more expensive computations on bigger data.
We can give the computations capabilities that the other back-end services wouldn't leverage anyway.

Design and Architecture Sub-tasks

[x] Define problem more clearly. Break into sub-problems and other GitHub Issues as necessary.
[x] Requirements gathering
[ ] Consider gathering requirements from senior researchers like Sergey and Brad on things like system requirements for computations.
[ ] Compile requirements into a coherent whole. Break up into sections such as user, developer, technical, and functional requirements.
[ ] Clusters requirements around sub-systems that we need to solve/architect for.
[ ] Identify reference AWS architecture, like on the Marketplace or the Solutions Center
[ ] Identify reference CircleCi architectures, preferably with one-click deployments already setup
[ ] Identify reference Terraform architectures: https://www.terraform.io/enterprise/before-installing/reference-architecture - @Nonzzo
[ ] Identify relevant Terraform modules - @Nonzzo
[ ] Research the relevancy of Packer: https://learn.hashicorp.com/tutorials/terraform/packer - @Nonzzo
[ ] Solve for each requirement gathered, mapping the reference architectures gathered.

Requirements for Running the System on a Single EC2 Instance

The easiest way for us to get a foot in the door w CICD and container management with Coinstac's unique requirements is to have an EC2 instance that manages it itself via compose.

create a v large EC2 instance
- 500gb, 8ish cores, 32gb of ram (maybe more?)
- if theres a ec2 image with docker already ready to go, otherwise id say latest debian/ubuntu is fine
we need to decide where this will be managed, the Trends cloud (us) or Yurij (GSU), costs probably differ significantly

praeducer commented 2 years ago

Hey @rssk! I'm starting to position myself to work on this task. Any updates I should be aware of?

I'm thinking of exercising a mini best practice software development lifecycle as I work through this task. It's just complex enough that I see it worthy of an architecture diagram. Besides informing and quality checking my plans, it will also help teach others how our system works.

praeducer commented 2 years ago

As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.

rssk commented 2 years ago

Yeah definitely, my first stab at it was to have coinstac manage the containers like has via the docker daemon on a ‘hosting’ ec2 , which is basic but has the advantage of requiring no code or maybe very minimal code changes. However i am open to other management solutions, the issue is due to the connectivity or really the data sharing/volume mounting needed between the pipeline and the container it interacts with, this could be reasonably complicated

On Thu, Jun 23, 2022 at 4:35 PM Paul Prae @.***> wrote:

As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.

— Reply to this email directly, view it on GitHub https://github.com/trendscenter/coinstac/issues/1329#issuecomment-1164849623, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIVUTKKRAPFHHXCMHQD7YDVQTDC3ANCNFSM5WFKUHTQ . You are receiving this because you were mentioned.Message ID: @.***>

praeducer commented 2 years ago

Pain points I'd like to address as I explore a few architectural options here:

Decouple our containerized services from the rest of the application (what I see as the main task).
Decouple storage. I need to learn a little more about how each of the these containerized services work, but I'm hoping with little changes I can make the underlying storage more elastic. If this sounds like a big ask, know I don't think it is if we find it useful. Another thing to handle here is for containers to share data.
Benchmark costs between our options. Not just between onprem and cloud, but comparing multiple hybrid and cloud architectures. I expect there'll be two or three good options I'll come up with we can compare.
Explore managed cloud services that can help us reduce technical debt and empower our developers to be more productive. This is where we'll need to look at our budget and perform a total cost of ownership: https://aws.amazon.com/blogs/publicsector/cloud-economics-value-tco-assessment/
Look into supporting Singularity containers.

praeducer commented 2 years ago

Related tasks from our Product Roadmap:

Note to self: Need to organize the Product Roadmap better in the various repos and project boards. Too much in too many places.

praeducer commented 2 years ago

Core requirement: need to handle the application services and containerized computations.

praeducer commented 2 years ago

CI server would be a good place to find inspiration. Could even clone it.

praeducer commented 2 years ago

Don't over-engineer, but also keep in mind long-term thinking regarding our technical debt and the ease of developing on the platform with the team composition we have.

rssk commented 2 years ago

Mission bullet points -

have a solution that allows automated CI deployment
containerized applications for said deployment
the ability for the a containerized application (coinstac-server) to pull, launch, inspect, stop, share data (currently over a volume) and communicate (over WS) with computation containers it launches
wishlist - the ability to shell in and inspect running services

praeducer commented 2 years ago

Ok I think I have enough to do a first draft of some architectures. I’ll start with reference designs of existing systems first.

praeducer commented 2 years ago

@rssk If I find that decoupling storage and/or making it elastic is afforable and straight-forward to implement, do you see benefits to that as a feature? Like with developer time, troubleshooting, extensibility, and scalability.

praeducer commented 2 years ago

This video describes an architecture that has several patterns/components relevant here: https://youtu.be/nhqcecpi47s

praeducer commented 2 years ago

I'm thinking I'll come up with a phased architecture approach. For example, we could start out with computations running in the same server as the backend services. Then we can plan to decouple them on separate servers as long as we can solve for networking communications between the containers in that state.

This will allow us to address the unique needs of computations versus our middleware. For example, the computations may benefit from elastic storage while they may not be necessary for our backend services. Similarly, computations may need GPU support but that may not be necessary for our backend services.

Overall for the system, decoupling patterns like this will make the system more reliable (like resilient to failures) and take less of ITs time to update and maintain. We can address microservice issues in a more targeted manner, simplifying troubleshooting and reducing downtime.

praeducer commented 2 years ago

What microservices do we have in total running on the server in the current architecture?

MongoDB
MQTT
nginx
Node.js
NPM
Docker
GraphQL

Custom Node.js packages:

coinstac-api-server: GraphQL and MongoDB server. This is the backend API for the front-end clients.
coinstac-server: The "pipeline" server. Runs remote pipelines and sends results to the api-server. This middleware between the backend API and the pipelines that run computations. It manages backend data processing.
coinstac-ui: A webpack server used in our build step to bundle web assets.

Screen Shot 2022-07-12 at 10 05 37 AM .

praeducer commented 2 years ago

Requirements Analysis

What services does @rssk want running on our container server? The entire COINSTAC backend running on a server or servers. If plural servers, just need to keep in mind complexity and how we are iterating. It's the services in the current compose file minus 'ui'.
From the set of services we want running on the container server, which ones are already containerized? All backend services, but keep an eye out for containerizing them better.
From the set of services we want running on the container server, which ones would work better for us if they were containerized? Beta container architectures are kind of laid out. We need to change some things for production though. Production would have different build scripts. They are not all containerized in production, just on this testing/CI server.
What unique requirements do each of these containers have? Focus on storage, compute, memory, security, and scalability/elasticity. Most are really simple. Pretty low requirements (like 8 GB of RAM and 3 or 4 CPUs 20 GB of space, could look up benchmarks here since it is mostly regular node web app stuff). The pipeline service is most important to address. If two computations run at the same time, we'd need a massive amount of RAM (like VBM failing with less than 20 GB). We haven't really run into limitations here yet (adoption and stability are more pressing). Decoupling the pipeline manager could be the next iteration of this system.
What service manages the containerized computations? What unique privileges does it have? coinstac-server package consumes coinstac-pipeline to manage the computations. 'server' manages both the 'pipeline' package and the '(container)-manager' package. 'pipeline' just deals with the pipeline implementation but doesn't know anything about the containers themselves. 'manager' is actually managing the containers: running starting, stopping, and communications are examples. They pass around async functions.
What services running on this server talk to other local services on this server? How do they interface exactly (like what messages will they pass around and why)? All computation containers need access to the file system.
What services living outside of this server need to communicate to the services on this server? How do they interface exactly (like what messages will they pass around and why)? This will help design our security and networking infrastructure.

Need to talk to outside world:

MQTT: This is an out-of-the-box service. This is running on port 80.
API Server: This is our code. This needs to talk to clients.
The "fileServer" inside the 'pipeline' manager package (it's a single file in the package): this should be decoupled tho...

COINSTAC server is the big dog running things. Study this to learn more: https://github.com/trendscenter/coinstac/tree/master/packages/coinstac-ui/config

One way to summarize this is that we are currently deciding exactly what we are decoupling first.

Docker compose with Circle CI is what we use for testing right now. Not configured for, but we could start configuring it for, production. Probably be some changes leveraging the node package manager to get it production ready but not sure.

ci_network in the Docker compose yml file is used for the containers to talk to each other. There are some constraints to how we run Docker containers we had to work around. I can research 'named networks in Docker' and how to attach containers to them. Typically 'ui' is not on the same network like in the CI solution we currently have. It is like a 'bridge network'. This context will change depending the networking/server architecture we decide on for this task.

Circle CI: Read this: https://github.com/trendscenter/coinstac/blob/master/.circleci/config.yml. This is a hosted platform / SaaS product. No current owner. May have SSO with Github. Hit up Ross if need to be added. See the CI badge in our readme.

Ross worked on CI stuff, while Javier wrote many tests. Eduardo too.

We are currently looking for a more repeatable testing server for the moment. This could be a replacement for the dev server. Long term, we can take these learnings and apply them to production in whatever way makes sense. This is the longer-term plan.

IaaC with Terraform is cool for how to deliver this solution.

praeducer commented 2 years ago

Next Steps: I need to spend more time gathering requirements to make sure I design and build the best solution. I'm going to continue requirements analysis while also exploring potential solutions. As I look at reference architectures, I typically have more questions.

praeducer commented 2 years ago

@Nonzzo Do you know of any good architectures we could reference here?

praeducer commented 1 year ago

Requirements Analysis with @dylanmartin (context from the front-end UI container implementation):

As a user, I can't run singularity containers on windows.
I can't hit a button on the UI and have the pipeline manager work in all the same environments as the UI. It is coupled to the environment the main process is instantiated in, but it doesn't have to be.
The pipeline manager is unnecessarily restricted by the UI package that is written in Electron.
We may not be following good patterns for leveraging Electron's renderer and main processes.
When running on the client, the UI should be more separated the pipeline manager. If the UI is launched in one environment, the pipeline manager is also launched. It needs to be installed in a separate containerized environment. Then communicate via web sockets or whatever.
We need a better separation of infrastructure and application code. In some ways, they need to be seprate architectures that interface well. There is software architecture and there is systems architecture and they are different.

praeducer commented 1 year ago

My overall goals are becoming more clear.

A good main goal is to decouple the back-end packages from the front-end packages and then decouple the pipeline manager from the rest of the backend packages. Following that, I want to write infrastructure as code to deploy the system to multiple servers (like with pipeline manager service running on one server and the other backed services running on another server). Finally, this same code can be used to deploy the environment to dev, QAT, and prod or whatever context. We can then have multiple environments spun up that we can easily tear down and rebuild.

This will position us to separate our concerns so we can build out the most effective and efficient infrastructure because it will be more custom-built for the workload (like the services running on the machine). For example, with the pipeline manager and computations, we can leverage more elastic cloud services while placing the other backend services on a reserved instance or on-prem. We can also give the computations GPU capabilities that the other services don't need. Another related side effect here is more easy cost management. We pay for only what we use better and have more options on where we deploy things.

praeducer commented 1 year ago

If there was a web server (maybe living in a container) the UI could interact with the pipeline manager. It seems deploying to client environments is quite different but also the same as the server-side environments (like mostly the same code and installation pattern, but not quite)... need more similar patterns across different deployment contexts or something. More to learn there! It's not just about the back-end env with dev, QAT, and prod then some separate front-end. There is also cloud versus on-prem, and server-side versus client-side.

praeducer commented 1 year ago

These are two types of elastic storage I’m considering for the pipeline manager server:

Here is the actual feature for dynamically increasing volume size: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requesting-ebs-volume-modifications.html

Events to hook up to: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html#volume-modification-events

praeducer commented 1 year ago

Requirements analysis with @pixelsaurus:

A service to act as like a router that allows you to choose what remote node to communicate with. Then we could have a some function call to compile a list of also available remote nodes available on AWS. Then an admin account could choose which node per consortium. There can be many nodes with varying capabilities we can choose from. This would allow for us to meet custom requirements by consortia.
The trouble with pipeline there are many nuanced configurations for each computation. The manager is very much a server-side service. There are parts of the manager that only ever run when installed on the client machine and some parts that only ever run when in a server environment. Should this be two separate packages? Seems could have more of a separation of concerns.
Could consortium and pipeline be the same thing from the UI perspective? Users don't necessarily care about the pipeline, they want to run computations.
We have hardly touched CI/CD so starting pretty fresh there. Once we have these service nodes, it would be good to have each computation run on its own elastic node. CI/CD would be a part of deploying the computations to the nodes.

praeducer commented 1 year ago

It would be good to diagram all of the different ways we configure this system (like installing on clients versus cloud versus on-prem versus our desktops versus CPU versus GPU etc) and describe how the distribution of the services change (like what services are running and where).

praeducer commented 1 year ago

Note to self: leverage on-prem when we can: https://trendscenter.github.io/wiki/

praeducer commented 1 year ago

CircleCI's heavy Terraform integration is promising:

CircleCI is a continuous integration and delivery (CI/CD) platform for automating software builds, tests, and deployments. The CI/CD paradigm establishes version control repositories as the source of truth for your deployments. It also helps teams quickly ship new features and fixes by defining pipelines that help ensure the stability and resilience of your services through testing and automation. You can build deployment pipelines of varying complexity to satisfy your organization’s requirements for production deployments.

Using Terraform to manage your infrastructure as code enables the benefits of the CI/CD workflow for infrastructure deployments. Since your infrastructure is codified, your team can collaborate and review it and deploy it using automated pipelines instead of manual orchestration. To automate Terraform operations in a remote environment, you need to configure remote state storage so Terraform can access and manage your project's state across runs.

https://learn.hashicorp.com/tutorials/terraform/circle-ci

praeducer commented 1 year ago

Similarly, with GitHub:

GitHub Actions add continuous integration to GitHub repositories to automate your software builds, tests, and deployments. Automating Terraform with CI/CD enforces configuration best practices, promotes collaboration and automates the Terraform workflow.

HashiCorp's "Setup Terraform" GitHub Action sets up and configures the Terraform CLI in your Github Actions workflow. This allows most Terraform commands to work exactly like they do on your local command line.

https://learn.hashicorp.com/tutorials/terraform/github-actions

praeducer commented 1 year ago

General use case we'll leverage Terraform for: https://www.terraform.io/use-cases/integrate-with-existing-workflows

Nonzzo commented 1 year ago

is this a direction we should be looking for creating the EC2 instance to manage the CI/CD pipeline? https://www.youtube.com/watch?v=qhKbgvDNodI

Nonzzo commented 1 year ago

IMG_FD1937132FE8-1

we could add security scanning of the docker image to mitigate vulnerabilities and also do continuous monitoring

praeducer commented 1 year ago

Packer sounds like a great tool to leverage: https://learn.hashicorp.com/tutorials/terraform/packer

praeducer commented 1 year ago

That continuous monitoring capability is pretty attractive @Nonzzo! Thanks for sharing.

praeducer commented 1 year ago

I think the main use cases for CircleCI lie in "Creation" with heavy Github integration and then "Orchestration" integrating Terraforma and Packer. All of these are glued together with the underlying AWS infrastructure that we can also leverage.

praeducer commented 1 year ago

Tutorial with a reference architecture that covers some of our use cases and preferred tooling: https://annalach.gitbook.io/aws-terraform-workshops/

praeducer commented 1 year ago

One goal I have is to find an existing solution, like in AWS Marketplace, that is what we need and is already setup with one-click deployment. The problem with AWS Marketplace is that it is all CloudFormation. Terraform has modules we could leverage (this came up with good search results "terraform circleci AWS ec2 docker module").

praeducer commented 1 year ago

Started a prioritization discussion in Slack: [Dylan Martin] The current pain point I'm dealing with is the coupling of services. I don't know if we're encountering any roadblocks that would be solved with automated deployment at the moment. [Ross] I think these [high-level goals] are actually more or less in order for me, besides the iac stuff being coupled w the deployments i think that, while the decoupling of packages might feel more immediate to the devs working on them, the prod/etc setups are a house of cards and needs to be changed asap

praeducer commented 1 year ago

There are two web socket servers listening inside each computation container. This means the Python process never exits until the process is done. If we want to run something more complex inside of that.

praeducer commented 1 year ago

An advantage of IaC in the deployment context is it can codify all of the kinds of deployments we have. We could have a repo of deployment packages we can run for different use cases. Note: @rssk mentioned in today's meeting he likes this idea.

praeducer commented 1 year ago

I'm currently migrating the plan laid out in this task to our wiki: https://github.com/trendscenter/coinstac/wiki/DevOps-Milestone-Planning.

Following that, I will break out the sub-tasks into Issues as necessary and tag them in the DevOps Milestone.

praeducer commented 1 year ago

@spanta28 This is what I suggest we focus on after documentation is in a really good spot.

praeducer commented 1 year ago

Need to simplify what to do and get it done. Focus on things like Dockerization and scripting. Deployment and reliability and ease of maintenance are much more important than scaling and modernization.

Try to put more of what we have into production, like the compose we have for CI. A good principle could be to get more hands off on the prod env and make that more automated.

Managed MongoDB would be an easier win. AWS-ify compose.

praeducer commented 1 year ago

KISS this stuff and keep it within the real constraints of our system and team.

praeducer commented 1 year ago

Need to source control things like networking and how all of our services communicate. Infrastructure as code. https://www.hashicorp.com/products/terraform.

trendscenter / coinstac