Release It! (100%) - Githubissues

github-actions[bot] commented 2 years ago

Congrats on starting Release It! by Michael T. Nygard, I hope you enjoy it! It has an average of 5/5 stars and 2 ratings on Google Books.

Book details (JSON)

```json { "title": "Release It!", "authors": [ "Michael T. Nygard" ], "publisher": "Pragmatic Bookshelf", "publishedDate": "2018-01-08", "description": "A single dramatic software failure can cost a company millions of dollars - but can be avoided with simple changes to design and architecture. This new edition of the best-selling industry standard shows you how to create systems that run longer, with fewer failures, and recover better when bad things happen. New coverage includes DevOps, microservices, and cloud-native architecture. Stability antipatterns have grown to include systemic problems in large-scale systems. This is a must-have pragmatic guide to engineering for production systems. If you're a software developer, and you don't want to get alerts every night for the rest of your life, help is here. With a combination of case studies about huge losses - lost revenue, lost reputation, lost time, lost opportunity - and practical, down-to-earth advice that was all gained through painful experience, this book helps you avoid the pitfalls that cost companies millions of dollars in downtime and reputation. Eighty percent of project life-cycle cost is in production, yet few books address this topic. This updated edition deals with the production of today's systems - larger, more complex, and heavily virtualized - and includes information on chaos engineering, the discipline of applying randomness and deliberate stress to reveal systematic problems. Build systems that survive the real world, avoid downtime, implement zero-downtime upgrades and continuous delivery, and make cloud-native applications resilient. Examine ways to architect, design, and build software - particularly distributed systems - that stands up to the typhoon winds of a flash mob, a Slashdotting, or a link on Reddit. Take a hard look at software that failed the test and find ways to make sure your software survives. To skip the pain and get the experience...get this book.", "image": "http://books.google.com/books/content?id=Ug9QDwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api", "language": "en", "averageRating": 5, "ratingsCount": 2, "categories": [ "Computers" ], "pageCount": 378, "isbn10": "1680504525", "isbn13": "9781680504521", "googleBooks": { "id": "Ug9QDwAAQBAJ", "preview": "http://books.google.com/books?id=Ug9QDwAAQBAJ&printsec=frontcover&dq=intitle:Release+It!&hl=&cd=1&source=gbs_api", "info": "https://play.google.com/store/books/details?id=Ug9QDwAAQBAJ&source=gbs_api", "canonical": "https://play.google.com/store/books/details?id=Ug9QDwAAQBAJ" } } ```

When you're finished with reading this book, just close this issue and I'll mark it as completed. Best of luck! 👍

ryanlevell commented 2 years ago

I've reached page 23.

ryanlevell commented 2 years ago

Assume every point of integration will fail, e.g. no response, slow response, invalid response.

Use a REST client that allows fine tuned control of timeouts; socket, read, etc. Treat a REST response as data until confirmed it meets expectations, rather than a client that maps directly to domain objects.

ryanlevell commented 2 years ago

To avoid cascading failures: Circuit Breakers and Timeouts for all integration points.
And always limit the time a thread can wait in resource pools.

ryanlevell commented 2 years ago

I've reached page 51.

ryanlevell commented 2 years ago

TODO: Review/organize ALL the anti-stability and stability summaries at the end of each topic. The book conveniently has all the patterns and short explanations after each detailed account.

ryanlevell commented 2 years ago

Page 92: Timeouts provide fault isolation. I.e. it prevents the fault from cascading through the system. If you don't include a timeout on every integration point, another system's problems become your problems.

Page 93: Consider creating a primtive to reuse timeouts. E.g. for querying a database so that you only need to get the timeout logic right once.

Page 94:

Always return quickly as the provider. Use the Fail-Fast pattern. Don't make your consumer wait for their timeout.
Retry one timeout/error, but slowly. There is a high chance the system is still not available if retried right away.
Fail fast to inbound, TImeout to outbound.

Page 106: Fail fast - Check the availability of required resources and the state of the circuit breaker BEFORE computing anything else. And fail if any resources are not available (I was in the wrong here for a work microservice. I wanted to replace an upfront health check with a circuit breaker. But both is ideal. Don't even trigger the integration point if it is known upfront that system is not available.)

Page 113: "[full integration testing] constrains the entire company to testing only one new piece of software at a time"

Page 114: Preserve or enhance system isolation and build a test harness that substitutes for the remote end of web service calls. (E.g. socket connection: refused, listen queue, ACK then never send data, send only RESET packets, many more...)

Page 120: Healthcheck status codes (i.e. 503) can also tell the load balancer to back off for awhile and that the service is under too much load.

ryanlevell commented 2 years ago

I've reached page 129.

ryanlevell commented 2 years ago

Page 131: Check vitals via GUI; latency, free heap memory, active requests handling threads, active sessions [I ~want~ need this monitoring]

Page 133: Team knew normal rhythm by watching GUI stats, surprising how easy it is to smell a problem.

Page 141: Survived Black Friday disaster because of high visibility into the running system. New logging wasn't necessary and wasn't time to add more anyway. Solution required execising control over the running system, wouldn't have recovered if had to reboot after every config change [I must reboot for each config change. What are the pros/cons of this?]

Page 147: Bonnie++ (Bonnie64) to measure storage throughput.

Page 162: Build transparency into our systems. Debugging a transparent system is much easier and will mature faster than opaque system.

Page 163: "Adding transparency late in the development lifecycle is about as effective as adding quality"

Page 169: Instance should emit metrics about self via log periodically.

Page 170: Health checks should be more than just "it's running".

Host IP(s)
Version of the runtime/interpreter
App version
Instance is accepting work
Status of connection pools, caches, and circuit breakers.

ryanlevell commented 2 years ago

I've reached page 181.

ryanlevell commented 2 years ago

184: Load shedding. Ideal place is load balancers. Good health check on the first tier of services. 503 when all fail health checks. Services can measure own response time, check own operational state to see if requests will be answered in a timely manner. Monitor degree of contention for connection pool to estimate wait times. Service can also check response times of dependencies, then health check should show unavailable (provides back pressure). Services should have short listen queues.

ryanlevell commented 2 years ago

189: Service Discovery.

Zoo Keeper: CP in CAP
Consul: AP in CAP

ryanlevell commented 2 years ago

196: Have postmortems on successful changes. See what the "near misses" were. E.g. did someone type an incorrect command, but catch it before executing? Find out how they caught it. And what safety net could have helped stop/catch it.

ryanlevell commented 2 years ago

205: Metrics Michael has continually found useful:

Traffic Indicators (page reqs, page reqs total, transaction counts, concurrent sessions)
Business transaction, for each type (number processed, number aborted, dollar value, transaction aging, conversion rate, completion rate)
Users (demographics, technographics, % registered, # users, usage patterns, errors encountered, successful logins, unsuccessful logins)
Resource pool health (enabled state, total resources, resources checked out, threads blocked, etc...)
Database connection health (# SQLExceptions, # queries, average response time)
Data consumption (# entities/rows present, memory/disk footprint)
Integration point health (circuit breaker state, # timeouts, # requests, avg response time, # good responses, # network errors, # protocol errors, # application errors, # concurrent requests, # concurrent request high water mark)

ryanlevell commented 10 months ago

Chapter 12 first page:

It isn't enough to write the code. Nothing is done until it runs in production.

239:

[In reference to long, manual deployments] Using people as if they were bots. Disrupting lives, families, sleep patterns... it was all such a waste.

246:

Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime. It might be a great implementation of a feature nobody wants. Until you push it to production, you can't be sure. The idea of continuous deployment is to reduce that delay as much as possible to minimize the liability of undeployed code.

A vicious cycle is at play between deployment size and risk, too. [...] As the time from check-in to production increases, more changes accumulate in the deployment. A bigger deployment with is definitely riskier. When those risks materialize, the most natural reaction is to add review steps as a way to mitigate future risks. But that will lengthen the commit-production delay, which increases risk even further!

There's only one way to break out of this cycle: internalize the motto: "If it hurts, do it more often."

254:

The third major approach is the one I like best. I call it "trickle, then batch" In this strategy, we don't apply one massive migration to all documents. Rather, we add some conditional code in the new version that migrates documents as they are touched [...]. This adds a bit of latency to each request, so it basically amortizes the batched migration time across many requests.

What about the documents that don't get touched for a long time? Thats where the batch part comes in. After this has run in production for awhile, you'll find that the most active documents are updated. Now you can run a batch migration on the remainder. It's safe to run concurrently with production, because no old instances are around (After all, the deployment finished days or weeks ago.) Once the batch migration is done, you can even push a new deployment that removes the conditional check for the old version.

258:

Every application and service should include an end-to-end "health check" route. The load balancer can check that route to see if the instance is accepting work. It's also a useful thing for monitoring and debugging. A good health check page reports the application version, the runtime's version, the host's IP address, and the status of connection pools, caches, and circuit breakers.

With this kind of health check, a simple status change in the application can inform the load balancer not to send any new work to the machine. Existing requests will be allowed to complete. We can use the same flag when starting the service after pushing the code. Often considerable time elapses between when the service starts listening on a socket and when it's really ready to do work. The service should start with the "available" flag set to false so the load balancer doesn't send requests prematurely.

Chapter 14 first page:

It won't come as a surprise to learn that different consumers of your service have different goals and needs. Each consuming application has its own development team that operates on its own schedule. If you want others to respect your autonomy, then you must respect theirs. That means you can't force consumers to match your release schedule. They shouldn't have to make a new release at the same time as yours just so you can change your APL. That is trivially true if you provide Saas services across the Internet, but it aso holds within a single organization or across a partner channel. Trying to cordinate consumer and provider deployments doesn't scale. Follow the ip effect from your deployment and you might find that the whole compould lis to upgrade at once. That means most new versions of a service should be compatible.

266:

But wait a minute! The documentation said to pass in a URL. Anything else is bad input and the behavior is undefined. It could do absolutely anything. The classic definition of "undefined behavior" for a function means It may decide to format your hard drive. It doesn't matter. As soon as the service went live, its implementation becomes the de facto specification.

267:

One project of mine had a shared data format used by two geographically separated teams. We discussed, negotiated, and documented a specification that we could all support. But we went a step further. As the consuming group, my team wrote FIT tests that illustrated every case in the specification. We thought of these as contract tests. That suite ran against the staging system from the other team. Just the act of writing the tests uncovered a huge number of edge cases we hadn't thought about. When almost 100 percent of the tests failed on their first run, that's when we really got specific in the spec. Once the tests all passed, we had a lot of confidence in the integration. In fact, our production deployment went very smoothly and we had no operational failures in that integration over the first year. I don't think it would have worked nearly as well if we'd had the implementing team write the tests.

279:

Early in my time on the project, I realized that the development teams were building everything to pass testing, not to run in production. Across the fifteen applications and more then five hundred integration points, every single configuration file was written for the integration-testing environment.

288:

All these rapid response actions share some common themes. First, nothing is as permanent as a temporary fix. Most of these remained in place for multiple years.

292:

Thrashing happens when your organization changes direction without taking the time to receive, process, and incorporate feedback. You may recognize it as constantly shifting development priorities or an unending series of crises.

To avoid thrashing, try to create a steady cadence of delivery and feedback. If one runs faster than the other, you could slow it down, but I wouldn't recommend it! Instead, use the extra time to find ways to speed up the other process. For example. if development moves faster than feedback, don't use the spare cycles to build dev tools that speed up deployment. Instead, build an experimentation platform to help speed up observation and decisions.

ryanlevell commented 9 months ago

300:

A more enlightened view of efficiency looks at the process from the point of view of the work instead of the workers. An efficient value stream has a short cycle time and high throughput. This kind of efficiency is better for the bottom line than high utilization.

301:

Adaptability doesn't happen by accident. If there's a natural order to software, it's the Big Ball of Mud.* Without close attention, dependencies proliferate and coupling draws disparate systems into one brittle whole.

304:

Don't pursue microservices just because the Silicon Valley unicorns are doing it. Make sure they address a real problem you're likely to suffer. Otherwise, the operas. tonal overhead and debugging difficulty of microservices will outweigh your benefits.

306:

Suppose instead the initial fragment of JSON looked like this: ("ItemID"; "https://example.com/policies/029292934"} This URL still works if we just want to use it as an opaque token to pass for-ward. From one perspective, it's still just a Unicode string. This URL also still works if we need to resolve it to get more information. But now our service doesn't have to bake in knowledge of the solitary authority. We can support more than one of them. By the way, using a full URL also makes integration testing easier. We no longer need "test" versions of the other services. We can supply our own test harnesses and use URLs to those instead of the production authorities.

317:

As shown in the figure on page 318, a "polly proxy" can map from a client ID (whether that client is internal or external makes no difference) to a catalog ID. This way, questions of ownership and access can be factored out of the catalog service itself into a more centrally controlled location.

321:

Nouns break down. Being a "customer" isn't the defining trait of a person or company. Nobody wakes up in the morning and says, "I'm happy to be a General Mills customer!" "Customer" describes one facet of that entity. It's about how your organization relates to that entity. To your sales team, a customer is someone who might someday sign another contract. To your support organization, a customer is someone who is allowed to raise a ticket. To your accounting group, a customer is defined by a commercial relationship. Each of those groups is interested in different attributes of the customer. Each applies a diferent life cycle to the idea of what a customer is. Your support team doesn't want its "search by name" results cluttered up with every prospect Your sales team ever pursued. Even the question, "Who is allowed to create a customer instance?" will vary.

323:

Use and abuse of identifiers causes lots of unnecessary coupling between systems. We can invert the relationship by making our service issue identifiers rather than receiving an "owner ID." And we can take advantage of the dual nature of URLs to both act like an opaque token or an address we can dereference to get an entity.

330:

Also, as Charity Majors, CEO of Honeycomb.io says, "If you have a wall full of green dashboards, that means your monitoring tools aren't good enough." There's always something weird going on.

ryanlevell commented 9 months ago

Done.

github-actions[bot] commented 9 months ago

You completed this book in 1 year, 8 months, 2 weeks, 22 hours, 12 minutes, 20 seconds, great job!

ryanlevell / books

Release It! (100%) #2