Open lzh2nix opened 3 years ago
在开篇中作者提到传统的学习方法:基于man page的学习,这样方式学习确实比较低效,这点确实深有体会,之前也会man去查询看下一条命令怎么使用,但是很难将多个命令串起来,后面看了<<linux 性能分析与优化>>专栏才将这些串起来。也就是这里提到的 “know-how”,通过这些前人的经验知道线上如果出现问题怎么定位,有哪些工具方法可以定位。 一些关键指标:
性能优化并不一定和Cost相关有时候对客户体验也会带来很大的提升(延时的优化) New York到London花了$300M建了一条光缆只为优化6ms的延时, 可以看到有些系统中性能的多么关键。
引发性能问题的两个原因:
有时候出现问题的时候,我们首先想到的是系统负载升高了,都认为是外部引起的,在分析问题时软硬件配置这一点也容易忽略。
定位性能问题还是要从整体出发, CPU/网络/磁盘/内存,应用进程模型
Known-Unknowns
Known-knowns: These are things you know. You know you should be checking a performance metric, and you know its current value. For example, you know you should be checking CPU utilization, and you also know that the value is 10% on average.
Known-unknowns: These are things you know that you do not know. You know you can check a metric or the existence of a subsystem, but you haven’t yet observed it. For example, you know you could use profiling to check what is making the CPUs busy, but have yet to do so.
Unknown-unknowns: These are things you do not know that you do not know. For example, you may not know that device interrupts can become heavy CPU consumers, so you are not checking them.
Performance is a field where “the more you know, the more you don’t know.” The more you learn about systems, the more unknown-unknowns you become aware of, which are then known-unknowns that you can check on.
性能分析的两种方法:
Resource based analysis(system admin approach) 基于资源层面的分析法,主要foucs在资源的利用率上,一般通过iops, throughput, utilization, saturation 来进行分析,主要工具有vmstat,iostat, mpstat.
Workload based analysis(developer approach) 从为外部的视角来看系统的性能情况,一般通过requests(请求数qps),latency(延时), completion(错误率),这些指标都可以通过promethous等工具来观测。尤其是一些慢请求的分析。
定位问题的N种方法:
StreetLight Anti-Method:习惯性的敲一下top,不知道top能不能有用,也不知道有哪些工具可以进一步分析,这里举的例子很有意思 “醉汉在一个有灯的街道上找钥匙,并不是钥匙丢在这里, 而是这里的灯光比较亮”。
Random Change Anti-Method:瞎猜性分析法,改变一个参数-->验证--->改变另外一个参数--->验证...,典型的穷举法。
Blame-SomeOne-Else Anti-Method:甩锅法
Ad Hoc checklist Mehtod:制定一个线上问题定位checklist,在checklist中详细说明每个指标都代表什么含义。
Problem Statement: 带着问题定位
USE Method: for every resource, check untilization, saturation, errors, 这也是本书使用的方法
基于USE的一些分析方法:
Resource | Type | Metric |
---|---|---|
CPU | Utilization | CPU utilization (either per CPU or a system-wide average) |
CPU | Saturation | Run queue length, scheduler latency, CPU pressure (Linux PSI) |
Memory | Utilization | Available free memory (system-wide) |
Memory | Saturation | Swapping (anonymous paging), page scanning, out-of memory events, memory pressure (Linux PSI) |
Network interface | Utilization | receive throuthput/max bandwitdth, transmit throughput/max bandwidth |
Storage device I/O | Saturation | Device busy percent |
Storage device I/O | Utilization | Wait queue length, I/O pressure (Linux PSI) |
除了上面提到的CPU/Netork/Memory/Disk, 在应用层面可以对一下资源进行分析:
Mutex locks: Utilization may be defined as the time the lock was held, saturation by those threads queued waiting on the lock.
Thread pools: Utilization may be defined as the time threads were busy processing work, saturation by the number of requests waiting to be serviced by the thread pool.
Process/thread capacity: The system may have a limited number of processes or threads, whose current usage may be defined as utilization; waiting on allocation may be saturation; and errors are when the allocation failed (e.g., “cannot fork”).
File descriptor capacity: Similar to process/thread capacity, but for file descriptors.
RED Method:Requset rate, Errors, duration
Workload Characterization:基于负载的分析(三个W一个H分析法)
capacity planning:基于指标去做容量的规划,比如cpu利用率达到60%就进行扩容,日过aws的ASG。在k8s环境下做容量规划就更方便了,watch相关指标然后直接scale in/scale out即可。
内核将部分功能开放给用户层会是一个很大的热门, DPDK, ebpf,user-mode syscall, memory mapping, kernel bypass.
几个常见的中断:
系统层面关键是掌握几个stack(network stack, fs stack, memory stack)
系统个模块负载查看神器:
静态分析神器:
最小可用observability tools
sar 指南:
这一章提到了太多的工具,有需要再回头来看即可。
在学习极客时间<<linux 性能分析与优化>>的时候快速的刷过一遍第一版,最近看到第二版里引入了很多新的东西,二刷顺便整理一下知识点。
目录