trendscenter / coinstac

Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation
MIT License
42 stars 19 forks source link

Mega-Task: DevOps Future State Architecture #1329

Open rssk opened 2 years ago

rssk commented 2 years ago

Task Description

High-Level Technical Goals

Business Outcomes To Support With These Designs

Examples Use Cases

Design and Architecture Sub-tasks

Requirements for Running the System on a Single EC2 Instance

The easiest way for us to get a foot in the door w CICD and container management with Coinstac's unique requirements is to have an EC2 instance that manages it itself via compose.

praeducer commented 2 years ago

Hey @rssk! I'm starting to position myself to work on this task. Any updates I should be aware of?

I'm thinking of exercising a mini best practice software development lifecycle as I work through this task. It's just complex enough that I see it worthy of an architecture diagram. Besides informing and quality checking my plans, it will also help teach others how our system works.

praeducer commented 2 years ago

As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.

rssk commented 2 years ago

Yeah definitely, my first stab at it was to have coinstac manage the containers like has via the docker daemon on a ‘hosting’ ec2 , which is basic but has the advantage of requiring no code or maybe very minimal code changes. However i am open to other management solutions, the issue is due to the connectivity or really the data sharing/volume mounting needed between the pipeline and the container it interacts with, this could be reasonably complicated

On Thu, Jun 23, 2022 at 4:35 PM Paul Prae @.***> wrote:

As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.

— Reply to this email directly, view it on GitHub https://github.com/trendscenter/coinstac/issues/1329#issuecomment-1164849623, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIVUTKKRAPFHHXCMHQD7YDVQTDC3ANCNFSM5WFKUHTQ . You are receiving this because you were mentioned.Message ID: @.***>

praeducer commented 2 years ago

Pain points I'd like to address as I explore a few architectural options here:

praeducer commented 2 years ago

Related tasks from our Product Roadmap:

Note to self: Need to organize the Product Roadmap better in the various repos and project boards. Too much in too many places.

praeducer commented 2 years ago

Core requirement: need to handle the application services and containerized computations.

praeducer commented 2 years ago

CI server would be a good place to find inspiration. Could even clone it.

praeducer commented 2 years ago

Don't over-engineer, but also keep in mind long-term thinking regarding our technical debt and the ease of developing on the platform with the team composition we have.

rssk commented 2 years ago

Mission bullet points -

praeducer commented 2 years ago

Ok I think I have enough to do a first draft of some architectures. I’ll start with reference designs of existing systems first.

praeducer commented 2 years ago

@rssk If I find that decoupling storage and/or making it elastic is afforable and straight-forward to implement, do you see benefits to that as a feature? Like with developer time, troubleshooting, extensibility, and scalability.

praeducer commented 2 years ago

image

This video describes an architecture that has several patterns/components relevant here: https://youtu.be/nhqcecpi47s

image

praeducer commented 2 years ago

I'm thinking I'll come up with a phased architecture approach. For example, we could start out with computations running in the same server as the backend services. Then we can plan to decouple them on separate servers as long as we can solve for networking communications between the containers in that state.

This will allow us to address the unique needs of computations versus our middleware. For example, the computations may benefit from elastic storage while they may not be necessary for our backend services. Similarly, computations may need GPU support but that may not be necessary for our backend services.

Overall for the system, decoupling patterns like this will make the system more reliable (like resilient to failures) and take less of ITs time to update and maintain. We can address microservice issues in a more targeted manner, simplifying troubleshooting and reducing downtime.

praeducer commented 2 years ago

What microservices do we have in total running on the server in the current architecture?

  1. MongoDB
  2. MQTT
  3. nginx
  4. Node.js
  5. NPM
  6. Docker
  7. GraphQL

Custom Node.js packages:

  1. coinstac-api-server: GraphQL and MongoDB server. This is the backend API for the front-end clients.
  2. coinstac-server: The "pipeline" server. Runs remote pipelines and sends results to the api-server. This middleware between the backend API and the pipelines that run computations. It manages backend data processing.
  3. coinstac-ui: A webpack server used in our build step to bundle web assets.

Screen Shot 2022-07-12 at 10 05 37 AM.

praeducer commented 2 years ago

Requirements Analysis

  1. What services does @rssk want running on our container server? The entire COINSTAC backend running on a server or servers. If plural servers, just need to keep in mind complexity and how we are iterating. It's the services in the current compose file minus 'ui'.
  2. From the set of services we want running on the container server, which ones are already containerized? All backend services, but keep an eye out for containerizing them better.
  3. From the set of services we want running on the container server, which ones would work better for us if they were containerized? Beta container architectures are kind of laid out. We need to change some things for production though. Production would have different build scripts. They are not all containerized in production, just on this testing/CI server.
  4. What unique requirements do each of these containers have? Focus on storage, compute, memory, security, and scalability/elasticity. Most are really simple. Pretty low requirements (like 8 GB of RAM and 3 or 4 CPUs 20 GB of space, could look up benchmarks here since it is mostly regular node web app stuff). The pipeline service is most important to address. If two computations run at the same time, we'd need a massive amount of RAM (like VBM failing with less than 20 GB). We haven't really run into limitations here yet (adoption and stability are more pressing). Decoupling the pipeline manager could be the next iteration of this system.
  5. What service manages the containerized computations? What unique privileges does it have? coinstac-server package consumes coinstac-pipeline to manage the computations. 'server' manages both the 'pipeline' package and the '(container)-manager' package. 'pipeline' just deals with the pipeline implementation but doesn't know anything about the containers themselves. 'manager' is actually managing the containers: running starting, stopping, and communications are examples. They pass around async functions.
  6. What services running on this server talk to other local services on this server? How do they interface exactly (like what messages will they pass around and why)? All computation containers need access to the file system.
  7. What services living outside of this server need to communicate to the services on this server? How do they interface exactly (like what messages will they pass around and why)? This will help design our security and networking infrastructure.

Need to talk to outside world:

COINSTAC server is the big dog running things. Study this to learn more: https://github.com/trendscenter/coinstac/tree/master/packages/coinstac-ui/config

One way to summarize this is that we are currently deciding exactly what we are decoupling first.

Docker compose with Circle CI is what we use for testing right now. Not configured for, but we could start configuring it for, production. Probably be some changes leveraging the node package manager to get it production ready but not sure.

ci_network in the Docker compose yml file is used for the containers to talk to each other. There are some constraints to how we run Docker containers we had to work around. I can research 'named networks in Docker' and how to attach containers to them. Typically 'ui' is not on the same network like in the CI solution we currently have. It is like a 'bridge network'. This context will change depending the networking/server architecture we decide on for this task.

Circle CI: Read this: https://github.com/trendscenter/coinstac/blob/master/.circleci/config.yml. This is a hosted platform / SaaS product. No current owner. May have SSO with Github. Hit up Ross if need to be added. See the CI badge in our readme.

Ross worked on CI stuff, while Javier wrote many tests. Eduardo too.

We are currently looking for a more repeatable testing server for the moment. This could be a replacement for the dev server. Long term, we can take these learnings and apply them to production in whatever way makes sense. This is the longer-term plan.

IaaC with Terraform is cool for how to deliver this solution.

praeducer commented 2 years ago

Next Steps: I need to spend more time gathering requirements to make sure I design and build the best solution. I'm going to continue requirements analysis while also exploring potential solutions. As I look at reference architectures, I typically have more questions.

praeducer commented 2 years ago

@Nonzzo Do you know of any good architectures we could reference here?

praeducer commented 1 year ago

Requirements Analysis with @dylanmartin (context from the front-end UI container implementation):

praeducer commented 1 year ago

My overall goals are becoming more clear.

A good main goal is to decouple the back-end packages from the front-end packages and then decouple the pipeline manager from the rest of the backend packages. Following that, I want to write infrastructure as code to deploy the system to multiple servers (like with pipeline manager service running on one server and the other backed services running on another server). Finally, this same code can be used to deploy the environment to dev, QAT, and prod or whatever context. We can then have multiple environments spun up that we can easily tear down and rebuild.

This will position us to separate our concerns so we can build out the most effective and efficient infrastructure because it will be more custom-built for the workload (like the services running on the machine). For example, with the pipeline manager and computations, we can leverage more elastic cloud services while placing the other backend services on a reserved instance or on-prem. We can also give the computations GPU capabilities that the other services don't need. Another related side effect here is more easy cost management. We pay for only what we use better and have more options on where we deploy things.

praeducer commented 1 year ago

If there was a web server (maybe living in a container) the UI could interact with the pipeline manager. It seems deploying to client environments is quite different but also the same as the server-side environments (like mostly the same code and installation pattern, but not quite)... need more similar patterns across different deployment contexts or something. More to learn there! It's not just about the back-end env with dev, QAT, and prod then some separate front-end. There is also cloud versus on-prem, and server-side versus client-side.

praeducer commented 1 year ago

These are two types of elastic storage I’m considering for the pipeline manager server:

Here is the actual feature for dynamically increasing volume size: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requesting-ebs-volume-modifications.html

Events to hook up to: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html#volume-modification-events

praeducer commented 1 year ago

Requirements analysis with @pixelsaurus:

praeducer commented 1 year ago

It would be good to diagram all of the different ways we configure this system (like installing on clients versus cloud versus on-prem versus our desktops versus CPU versus GPU etc) and describe how the distribution of the services change (like what services are running and where).

praeducer commented 1 year ago

Note to self: leverage on-prem when we can: https://trendscenter.github.io/wiki/

praeducer commented 1 year ago

CircleCI's heavy Terraform integration is promising:

CircleCI is a continuous integration and delivery (CI/CD) platform for automating software builds, tests, and deployments. The CI/CD paradigm establishes version control repositories as the source of truth for your deployments. It also helps teams quickly ship new features and fixes by defining pipelines that help ensure the stability and resilience of your services through testing and automation. You can build deployment pipelines of varying complexity to satisfy your organization’s requirements for production deployments.

Using Terraform to manage your infrastructure as code enables the benefits of the CI/CD workflow for infrastructure deployments. Since your infrastructure is codified, your team can collaborate and review it and deploy it using automated pipelines instead of manual orchestration. To automate Terraform operations in a remote environment, you need to configure remote state storage so Terraform can access and manage your project's state across runs.

https://learn.hashicorp.com/tutorials/terraform/circle-ci

praeducer commented 1 year ago

Similarly, with GitHub:

GitHub Actions add continuous integration to GitHub repositories to automate your software builds, tests, and deployments. Automating Terraform with CI/CD enforces configuration best practices, promotes collaboration and automates the Terraform workflow.

HashiCorp's "Setup Terraform" GitHub Action sets up and configures the Terraform CLI in your Github Actions workflow. This allows most Terraform commands to work exactly like they do on your local command line.

https://learn.hashicorp.com/tutorials/terraform/github-actions

praeducer commented 1 year ago

General use case we'll leverage Terraform for: https://www.terraform.io/use-cases/integrate-with-existing-workflows

Nonzzo commented 1 year ago

is this a direction we should be looking for creating the EC2 instance to manage the CI/CD pipeline? https://www.youtube.com/watch?v=qhKbgvDNodI

Nonzzo commented 1 year ago

IMG_FD1937132FE8-1

we could add security scanning of the docker image to mitigate vulnerabilities and also do continuous monitoring

praeducer commented 1 year ago

Packer sounds like a great tool to leverage: https://learn.hashicorp.com/tutorials/terraform/packer

praeducer commented 1 year ago

That continuous monitoring capability is pretty attractive @Nonzzo! Thanks for sharing.

praeducer commented 1 year ago

I think the main use cases for CircleCI lie in "Creation" with heavy Github integration and then "Orchestration" integrating Terraforma and Packer. All of these are glued together with the underlying AWS infrastructure that we can also leverage.

praeducer commented 1 year ago

Tutorial with a reference architecture that covers some of our use cases and preferred tooling: https://annalach.gitbook.io/aws-terraform-workshops/

praeducer commented 1 year ago

One goal I have is to find an existing solution, like in AWS Marketplace, that is what we need and is already setup with one-click deployment. The problem with AWS Marketplace is that it is all CloudFormation. Terraform has modules we could leverage (this came up with good search results "terraform circleci AWS ec2 docker module").

praeducer commented 1 year ago

Started a prioritization discussion in Slack: [Dylan Martin] The current pain point I'm dealing with is the coupling of services. I don't know if we're encountering any roadblocks that would be solved with automated deployment at the moment. [Ross] I think these [high-level goals] are actually more or less in order for me, besides the iac stuff being coupled w the deployments i think that, while the decoupling of packages might feel more immediate to the devs working on them, the prod/etc setups are a house of cards and needs to be changed asap

praeducer commented 1 year ago

There are two web socket servers listening inside each computation container. This means the Python process never exits until the process is done. If we want to run something more complex inside of that.

praeducer commented 1 year ago

An advantage of IaC in the deployment context is it can codify all of the kinds of deployments we have. We could have a repo of deployment packages we can run for different use cases. Note: @rssk mentioned in today's meeting he likes this idea.

praeducer commented 1 year ago

I'm currently migrating the plan laid out in this task to our wiki: https://github.com/trendscenter/coinstac/wiki/DevOps-Milestone-Planning.

Following that, I will break out the sub-tasks into Issues as necessary and tag them in the DevOps Milestone.

praeducer commented 1 year ago

@spanta28 This is what I suggest we focus on after documentation is in a really good spot.

praeducer commented 1 year ago

Need to simplify what to do and get it done. Focus on things like Dockerization and scripting. Deployment and reliability and ease of maintenance are much more important than scaling and modernization.

Try to put more of what we have into production, like the compose we have for CI. A good principle could be to get more hands off on the prod env and make that more automated.

Managed MongoDB would be an easier win. AWS-ify compose.

praeducer commented 1 year ago

KISS this stuff and keep it within the real constraints of our system and team.

praeducer commented 1 year ago

Need to source control things like networking and how all of our services communicate. Infrastructure as code. https://www.hashicorp.com/products/terraform.