Open bvvkrishna opened 3 years ago
Hello @bvvkrishna thank you for sharing your inputs. Here is my take on your suggestions.
Even in the case of an ML or offline job, when such jobs run metrics are collected in real-time to monitor the job execution. It is just that the jobs are scheduled to run at different times. During the run time these jobs too should emit metrics in an ideal scenario.
Adding more resources based on the projections is not a problem, but an efficient way of resource utilization. If I measure something regularly, I can identify anomalies. When such anomalies are detected I take actions to fix them. whether I am fixing the problem (permanent) or just mitigating it (temporary) completely depends on the actions I perform.
Yes, Availability is something that can be added to the content. Let me discuss this with the team.
We initially had the same thought but realized it is more impactful if we add details on why Mean/Median may not be a good measure and then explain percentile instead of adding a one-line description. We found a good blog explaining this with examples. Hence we added a link to the blog post while referring to percentile in the content.
The last two suggestions are around incident response than about monitoring. Hence I believe it is good to keep that under a separate course.
Also, We welcome any contribution from the community.
First of all i would like to say huge thanks to you guys for sharing the SRE world knowledge to the community. It is really useful and bring visibility on how important the SRE's are for the company and the expectations of this role.
I have looked at the Metrics and Monitoring section and i have some suggestions. Please check.
The statement "Monitoring is a process of collecting real-time performance metrics from a system" might not be correct for all use cases. There are certain ML or offline jobs which are measured once in a day or hour so we cannot say real-time performance metrics.
The statement "What gets measured, gets fixed" might not be true. For instance, lets say if an ecommerce system is experiencing huge traffic because of lot of requests from a single IP(DDOS attack) they will throttle the requests after a certain threshold or block but it is not fixing the problem rather i would say mitigating it. Similarly if an ecommerce systems is expecting to receive high traffic during sale event they might add hosts prior to the event(based on projection) to accomodate the traffic but does not mean we are fixing the problem rather finding a way to handle it.
In four golden signals of monitoring, i think we should also have Availability as one of the key metric which would help us to understand how much % of time service is available.
In basic terminologies of monitoring we should also add about what a percentile is? Because percentile is the one most frequently used in monitoring and engineers often get confused with this measurement.
In Command line tools, we should also add du command to get disk usage of directories as df shows free space at file system level. Also we should add ping, telnet, vmstat and lsof commands as these i see commonly used in operations world.
In Best Practices for Monitoring we should call out that we should try to bring the system to a stable state rather than trying to fix the problem when a production problem happens. Because getting the service under control is more important than fixing the problem itself.
In Best Practices for Monitoring we should also add "Never hesitate to escalate to the right team if needed". As every issue mitigation has its own SLA we should escalate to the right owner when needed rather than trying to deep dive and breaching the SLA which could cause impact to the customer.