Open rssk opened 2 years ago
Hey @rssk! I'm starting to position myself to work on this task. Any updates I should be aware of?
I'm thinking of exercising a mini best practice software development lifecycle as I work through this task. It's just complex enough that I see it worthy of an architecture diagram. Besides informing and quality checking my plans, it will also help teach others how our system works.
As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.
Yeah definitely, my first stab at it was to have coinstac manage the containers like has via the docker daemon on a ‘hosting’ ec2 , which is basic but has the advantage of requiring no code or maybe very minimal code changes. However i am open to other management solutions, the issue is due to the connectivity or really the data sharing/volume mounting needed between the pipeline and the container it interacts with, this could be reasonably complicated
On Thu, Jun 23, 2022 at 4:35 PM Paul Prae @.***> wrote:
As I architect, I'll make some of the necessary technical decisions like any subtleties around how the containers will be managed and the hardware requirements above. I can also include some total cost analysis between alternative architectures like deciding where to host the services.
— Reply to this email directly, view it on GitHub https://github.com/trendscenter/coinstac/issues/1329#issuecomment-1164849623, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIVUTKKRAPFHHXCMHQD7YDVQTDC3ANCNFSM5WFKUHTQ . You are receiving this because you were mentioned.Message ID: @.***>
Pain points I'd like to address as I explore a few architectural options here:
Related tasks from our Product Roadmap:
Note to self: Need to organize the Product Roadmap better in the various repos and project boards. Too much in too many places.
Core requirement: need to handle the application services and containerized computations.
CI server would be a good place to find inspiration. Could even clone it.
Don't over-engineer, but also keep in mind long-term thinking regarding our technical debt and the ease of developing on the platform with the team composition we have.
Mission bullet points -
coinstac-server
) to pull, launch, inspect, stop, share data (currently over a
volume) and communicate (over WS)
with computation containers it launchesOk I think I have enough to do a first draft of some architectures. I’ll start with reference designs of existing systems first.
@rssk If I find that decoupling storage and/or making it elastic is afforable and straight-forward to implement, do you see benefits to that as a feature? Like with developer time, troubleshooting, extensibility, and scalability.
This video describes an architecture that has several patterns/components relevant here: https://youtu.be/nhqcecpi47s
I'm thinking I'll come up with a phased architecture approach. For example, we could start out with computations running in the same server as the backend services. Then we can plan to decouple them on separate servers as long as we can solve for networking communications between the containers in that state.
This will allow us to address the unique needs of computations versus our middleware. For example, the computations may benefit from elastic storage while they may not be necessary for our backend services. Similarly, computations may need GPU support but that may not be necessary for our backend services.
Overall for the system, decoupling patterns like this will make the system more reliable (like resilient to failures) and take less of ITs time to update and maintain. We can address microservice issues in a more targeted manner, simplifying troubleshooting and reducing downtime.
What microservices do we have in total running on the server in the current architecture?
Custom Node.js packages:
.
Requirements Analysis
Need to talk to outside world:
COINSTAC server is the big dog running things. Study this to learn more: https://github.com/trendscenter/coinstac/tree/master/packages/coinstac-ui/config
One way to summarize this is that we are currently deciding exactly what we are decoupling first.
Docker compose with Circle CI is what we use for testing right now. Not configured for, but we could start configuring it for, production. Probably be some changes leveraging the node package manager to get it production ready but not sure.
ci_network in the Docker compose yml file is used for the containers to talk to each other. There are some constraints to how we run Docker containers we had to work around. I can research 'named networks in Docker' and how to attach containers to them. Typically 'ui' is not on the same network like in the CI solution we currently have. It is like a 'bridge network'. This context will change depending the networking/server architecture we decide on for this task.
Circle CI: Read this: https://github.com/trendscenter/coinstac/blob/master/.circleci/config.yml. This is a hosted platform / SaaS product. No current owner. May have SSO with Github. Hit up Ross if need to be added. See the CI badge in our readme.
Ross worked on CI stuff, while Javier wrote many tests. Eduardo too.
We are currently looking for a more repeatable testing server for the moment. This could be a replacement for the dev server. Long term, we can take these learnings and apply them to production in whatever way makes sense. This is the longer-term plan.
IaaC with Terraform is cool for how to deliver this solution.
Next Steps: I need to spend more time gathering requirements to make sure I design and build the best solution. I'm going to continue requirements analysis while also exploring potential solutions. As I look at reference architectures, I typically have more questions.
@Nonzzo Do you know of any good architectures we could reference here?
Requirements Analysis with @dylanmartin (context from the front-end UI container implementation):
My overall goals are becoming more clear.
A good main goal is to decouple the back-end packages from the front-end packages and then decouple the pipeline manager from the rest of the backend packages. Following that, I want to write infrastructure as code to deploy the system to multiple servers (like with pipeline manager service running on one server and the other backed services running on another server). Finally, this same code can be used to deploy the environment to dev, QAT, and prod or whatever context. We can then have multiple environments spun up that we can easily tear down and rebuild.
This will position us to separate our concerns so we can build out the most effective and efficient infrastructure because it will be more custom-built for the workload (like the services running on the machine). For example, with the pipeline manager and computations, we can leverage more elastic cloud services while placing the other backend services on a reserved instance or on-prem. We can also give the computations GPU capabilities that the other services don't need. Another related side effect here is more easy cost management. We pay for only what we use better and have more options on where we deploy things.
If there was a web server (maybe living in a container) the UI could interact with the pipeline manager. It seems deploying to client environments is quite different but also the same as the server-side environments (like mostly the same code and installation pattern, but not quite)... need more similar patterns across different deployment contexts or something. More to learn there! It's not just about the back-end env with dev, QAT, and prod then some separate front-end. There is also cloud versus on-prem, and server-side versus client-side.
These are two types of elastic storage I’m considering for the pipeline manager server:
Here is the actual feature for dynamically increasing volume size: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requesting-ebs-volume-modifications.html
Events to hook up to: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html#volume-modification-events
Requirements analysis with @pixelsaurus:
It would be good to diagram all of the different ways we configure this system (like installing on clients versus cloud versus on-prem versus our desktops versus CPU versus GPU etc) and describe how the distribution of the services change (like what services are running and where).
Note to self: leverage on-prem when we can: https://trendscenter.github.io/wiki/
CircleCI's heavy Terraform integration is promising:
CircleCI is a continuous integration and delivery (CI/CD) platform for automating software builds, tests, and deployments. The CI/CD paradigm establishes version control repositories as the source of truth for your deployments. It also helps teams quickly ship new features and fixes by defining pipelines that help ensure the stability and resilience of your services through testing and automation. You can build deployment pipelines of varying complexity to satisfy your organization’s requirements for production deployments.
Using Terraform to manage your infrastructure as code enables the benefits of the CI/CD workflow for infrastructure deployments. Since your infrastructure is codified, your team can collaborate and review it and deploy it using automated pipelines instead of manual orchestration. To automate Terraform operations in a remote environment, you need to configure remote state storage so Terraform can access and manage your project's state across runs.
Similarly, with GitHub:
GitHub Actions add continuous integration to GitHub repositories to automate your software builds, tests, and deployments. Automating Terraform with CI/CD enforces configuration best practices, promotes collaboration and automates the Terraform workflow.
HashiCorp's "Setup Terraform" GitHub Action sets up and configures the Terraform CLI in your Github Actions workflow. This allows most Terraform commands to work exactly like they do on your local command line.
https://learn.hashicorp.com/tutorials/terraform/github-actions
General use case we'll leverage Terraform for: https://www.terraform.io/use-cases/integrate-with-existing-workflows
is this a direction we should be looking for creating the EC2 instance to manage the CI/CD pipeline? https://www.youtube.com/watch?v=qhKbgvDNodI
we could add security scanning of the docker image to mitigate vulnerabilities and also do continuous monitoring
Packer sounds like a great tool to leverage: https://learn.hashicorp.com/tutorials/terraform/packer
That continuous monitoring capability is pretty attractive @Nonzzo! Thanks for sharing.
I think the main use cases for CircleCI lie in "Creation" with heavy Github integration and then "Orchestration" integrating Terraforma and Packer. All of these are glued together with the underlying AWS infrastructure that we can also leverage.
Tutorial with a reference architecture that covers some of our use cases and preferred tooling: https://annalach.gitbook.io/aws-terraform-workshops/
One goal I have is to find an existing solution, like in AWS Marketplace, that is what we need and is already setup with one-click deployment. The problem with AWS Marketplace is that it is all CloudFormation. Terraform has modules we could leverage (this came up with good search results "terraform circleci AWS ec2 docker module").
Started a prioritization discussion in Slack: [Dylan Martin] The current pain point I'm dealing with is the coupling of services. I don't know if we're encountering any roadblocks that would be solved with automated deployment at the moment. [Ross] I think these [high-level goals] are actually more or less in order for me, besides the iac stuff being coupled w the deployments i think that, while the decoupling of packages might feel more immediate to the devs working on them, the prod/etc setups are a house of cards and needs to be changed asap
There are two web socket servers listening inside each computation container. This means the Python process never exits until the process is done. If we want to run something more complex inside of that.
An advantage of IaC in the deployment context is it can codify all of the kinds of deployments we have. We could have a repo of deployment packages we can run for different use cases. Note: @rssk mentioned in today's meeting he likes this idea.
I'm currently migrating the plan laid out in this task to our wiki: https://github.com/trendscenter/coinstac/wiki/DevOps-Milestone-Planning.
Following that, I will break out the sub-tasks into Issues as necessary and tag them in the DevOps Milestone.
@spanta28 This is what I suggest we focus on after documentation is in a really good spot.
Need to simplify what to do and get it done. Focus on things like Dockerization and scripting. Deployment and reliability and ease of maintenance are much more important than scaling and modernization.
Try to put more of what we have into production, like the compose we have for CI. A good principle could be to get more hands off on the prod env and make that more automated.
Managed MongoDB would be an easier win. AWS-ify compose.
KISS this stuff and keep it within the real constraints of our system and team.
Need to source control things like networking and how all of our services communicate. Infrastructure as code. https://www.hashicorp.com/products/terraform.
Task Description
High-Level Technical Goals
Business Outcomes To Support With These Designs
Examples Use Cases
Design and Architecture Sub-tasks
Requirements for Running the System on a Single EC2 Instance
The easiest way for us to get a foot in the door w CICD and container management with Coinstac's unique requirements is to have an EC2 instance that manages it itself via compose.