tieubao / til

Today I Learned. These are what I've learned everyday, organized. #til.
57 stars 8 forks source link

Gitlab Runbooks #452

Open tieubao opened 4 years ago

tieubao commented 4 years ago

https://gitlab.com/gitlab-com/runbooks

tieubao commented 4 years ago

http://opsreportcard.com/section/11

  1. Does each service have an OpsDoc? Your DNS server dies. You rebuild it because you know how. Cool, right? You need to compile the newest version of BIND and install it, and you do it because you know how, right? When the monitoring system reports that error that happens now and then, you know how to fix it, right? You know how to do all this. Why write anything down?

Here's why:

Will you remember how to do these things 6 months from now? I find myself having to re-invent a process from scratch if I haven't done it in a few months (or sometimes just a few days!). Not only do I re-invent the process, I repeat all my old mistakes and learn from them again. What a waste of time.

Will you remember how to do these things when the pressure is on? My memory works worse during an emergency.

What about when you aren't around? How can you take a relaxing vacation if you feel burdened? You can't complain to be over-worked and unable to share your work with others if you haven't created a way to share the workload.

What about new people on the team? Should they learn how to do these things by watching you do it or can they learn on their own? If they can learn on their own and only bother you when they get stuck it saves you time and makes you look less like the information hording curmudgeon that you don't want to be. In fact, it makes people feel welcome and included if their new team has these kind of tasks documented.

How can your manager promote you or put you on a new and more interesting project if you are the only person with certain knowledge?

Each service should have certain things documented. If each service documents them the same way, people get used to it and can find what they need easier. I make a sub-wiki (or a mini-web site, or a Google Sites "Site") for each service:

Each of these has the same 7 tabs: (some may be blank)

Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare? SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective). If this is something being developed in-house, the 8th tab would be information for the team: how to set up a development environment, how to do integration testing, how to do release engineering, and other tips that developers will need. For example one project I'm on has a page that describes the exact steps for adding a new RPC to the system.

Be a hero and create the template for the rest of your team to use. Document a basic service like DNS to get started. Then do this for a bigger service. Create the skeleton so others can use it as a template and just fill in the missing pieces. Get in the habit of starting a new opsdoc any time you begin a new project.

tieubao commented 4 years ago

https://gist.github.com/voxxit/47e54a877bb56a8c8e3fd3492740aad2

Run Book / Operations Manual

  1. Table of Contents
  2. System Overview
    • Service Overview
    • Contributing Applications, Daemons, and Windows Services
    • Hours of Operation
    • Execution Design
    • Infrastructure and Network Design
    • Resilience, Fault Tolerance and High-Availability
    • Throttling and Partial Shutdown
    • Required Resources
    • Expected Traffic and Load
      • Hot or Peak Periods
      • Warm Periods
      • Cool or Quiet Periods
    • Environmental Differences
    • Tools
  3. Security and Access Control
  4. System Configuration
  5. Configuration Management
    • System Backup and Restore
      • Backup Requirements
        • Special Files
      • Backup Procedures
      • Restore Procedures
  6. Monitoring and Alerting
    • Error Messages
    • Events
    • Health Checks
    • Other Messages
  7. Operational Tasks
    • Deployment
    • Batch Processing
    • Power Procedures
    • Routine Checks
      • System Rebuilds
    • Troubleshooting
  8. Maintenance Tasks
    • Maintenance Procedures
      • Patching
        • Normal Cycle
        • Zero-Day Vulnerabilities
      • GMT/BST time changes
      • Cleardown Activities
        • Log Rotation
    • Testing
      • Technical Testing
      • Post-Deployment
  9. Failure and Recovery Procedures
    • Failover
    • Recovery
    • Troubleshooting Failover and Recovery
  10. Contact Details