再看<<性能之巅>> - Githubissues

lzh2nix commented 3 years ago

在学习极客时间<<linux 性能分析与优化>>的时候快速的刷过一遍第一版，最近看到第二版里引入了很多新的东西，二刷顺便整理一下知识点。

CH2(2021.05.09)

在开篇中作者提到传统的学习方法:基于man page的学习，这样方式学习确实比较低效，这点确实深有体会，之前也会man去查询看下一条命令怎么使用，但是很难将多个命令串起来，后面看了<<linux 性能分析与优化>>专栏才将这些串起来。也就是这里提到的 “know-how”，通过这些前人的经验知道线上如果出现问题怎么定位，有哪些工具方法可以定位。一些关键指标：

IOPS: 单位时间内的操作次数， ops
Throughput: 吞吐量 Bps， bps单位
Respone Time: 响应时间
Latency: 延迟(需要从整体上考虑系统的延时，dns延时，tcp连接延时，业务响应延时etc)
Utilizaton:利用率(time-based(U=B/T), capacity-based)
Saturation:饱和度(能处理的请求vs收到的请求，一般出现的系统负载升高的时候，而且在系统出现瓶颈时饱和度急剧上升，很容易忽略的一个参数)
Bottleneck:系统瓶颈(木桶效应，连锁反应)
Workload:负载
Cache:cache的hit率(也是一个容易忽略的参数，hit率和性能是指数级关系而非线性关系，hit率98%--->99%要比10%--->11%对性能影响大很多很多)。

性能优化并不一定和Cost相关有时候对客户体验也会带来很大的提升(延时的优化) New York到London花了$300M建了一条光缆只为优化6ms的延时，可以看到有些系统中性能的多么关键。

引发性能问题的两个原因：

系统负载升高
软硬件的配置错误

有时候出现问题的时候，我们首先想到的是系统负载升高了，都认为是外部引起的，在分析问题时软硬件配置这一点也容易忽略。

定位性能问题还是要从整体出发， CPU/网络/磁盘/内存，应用进程模型

Known-Unknowns

Known-knowns: These are things you know. You know you should be checking a performance metric, and you know its current value. For example, you know you should be checking CPU utilization, and you also know that the value is 10% on average.
Known-unknowns: These are things you know that you do not know. You know you can check a metric or the existence of a subsystem, but you haven’t yet observed it. For example, you know you could use profiling to check what is making the CPUs busy, but have yet to do so.
Unknown-unknowns: These are things you do not know that you do not know. For example, you may not know that device interrupts can become heavy CPU consumers, so you are not checking them.

Performance is a field where “the more you know, the more you don’t know.” The more you learn about systems, the more unknown-unknowns you become aware of, which are then known-unknowns that you can check on.

性能分析的两种方法：

Resource based analysis(system admin approach) 基于资源层面的分析法，主要foucs在资源的利用率上，一般通过iops， throughput, utilization, saturation 来进行分析，主要工具有vmstat,iostat, mpstat.
Workload based analysis(developer approach) 从为外部的视角来看系统的性能情况，一般通过requests(请求数qps)，latency(延时), completion(错误率)，这些指标都可以通过promethous等工具来观测。尤其是一些慢请求的分析。

定位问题的N种方法：

StreetLight Anti-Method:习惯性的敲一下top，不知道top能不能有用，也不知道有哪些工具可以进一步分析，这里举的例子很有意思 “醉汉在一个有灯的街道上找钥匙，并不是钥匙丢在这里，而是这里的灯光比较亮”。
Random Change Anti-Method:瞎猜性分析法，改变一个参数-->验证--->改变另外一个参数--->验证...,典型的穷举法。
Blame-SomeOne-Else Anti-Method:甩锅法
Ad Hoc checklist Mehtod:制定一个线上问题定位checklist，在checklist中详细说明每个指标都代表什么含义。
Problem Statement: 带着问题定位
- What makes you think there is a performance problem?
- Has this system ever performed well?
- What changed recently? Software? Hardware? Load?
- Can the problem be expressed in terms of latency or runtime?
- Does the problem affect other people or applications (or is it just you)?
- What is the environment? What software and hardware are used? Versions? Configuration?
USE Method: for every resource, check untilization, saturation, errors, 这也是本书使用的方法
- Utilization: As a percent over a time interval (e.g., “One CPU is running at 90% utilization”)
- Saturation: As a wait-queue length (e.g., “The CPUs have an average run-queue length of four”)
- Errors: Number of errors reported (e.g., “This disk drive has had 50 errors”)

基于USE的一些分析方法：

Resource	Type	Metric
CPU	Utilization	CPU utilization (either per CPU or a system-wide average)
CPU	Saturation	Run queue length, scheduler latency, CPU pressure (Linux PSI)
Memory	Utilization	Available free memory (system-wide)
Memory	Saturation	Swapping (anonymous paging), page scanning, out-of memory events, memory pressure (Linux PSI)
Network interface	Utilization	receive throuthput/max bandwitdth, transmit throughput/max bandwidth
Storage device I/O	Saturation	Device busy percent
Storage device I/O	Utilization	Wait queue length, I/O pressure (Linux PSI)

除了上面提到的CPU/Netork/Memory/Disk，在应用层面可以对一下资源进行分析：

Mutex locks: Utilization may be defined as the time the lock was held, saturation by those threads queued waiting on the lock.
Thread pools: Utilization may be defined as the time threads were busy processing work, saturation by the number of requests waiting to be serviced by the thread pool.
Process/thread capacity: The system may have a limited number of processes or threads, whose current usage may be defined as utilization; waiting on allocation may be saturation; and errors are when the allocation failed (e.g., “cannot fork”).
File descriptor capacity: Similar to process/thread capacity, but for file descriptors.
RED Method:Requset rate, Errors, duration
Workload Characterization:基于负载的分析(三个W一个H分析法)
- Who is causing the load? Process ID, user ID, remote IP address?
- Why is the load being called? Code path, stack trace?
- What are the load characteristics? IOPS, throughput, direction (read/write), type? Include variance (standard deviation) where appropriate.
- How is the load changing over time? Is there a daily pattern?
- Drill-Down Analysis 层层递进法，从监控出发一步步探索真相。
- Monitoring: This is used for continually recording high-level statistics over time, and identifying or alerting if a problem may be present.
- Identification: Given a suspected problem, this narrows the investigation to particular resources or areas of interest, identifying possible bottlenecks.
- Analysis: Further examination of particular system areas is done to attempt to root cause and quantify the issue.