kunpengcompute / kunpengcompute.github.io

Kunpeng Tech Blog: https://kunpengcompute.github.io/
Apache License 2.0
17 stars 5 forks source link

跑benchmark?当心你的芯 #26

Open bzhaoopenstack opened 4 years ago

bzhaoopenstack commented 4 years ago

译者: bzhaoopenstack 作者: Krunal Bauskar 原文链接: https://mysqlonarm.github.io/Benchmarking-Mind-Your-Core/

最近,我们在运行基准测试时发现 MySQL 吞吐量的抖动。 即使对于普通用户来说也是如此,但是还有很多其他事情需要注意(尤其是 IO 性能瓶颈) ,以至于我们今天计划讨论的一些方面可能会被暂时省略。 在本文中,我们将讨论可能影响 MySQL 性能的一个原因。

{% raw %}

{% endraw %}

在启用 NUMA 的 vm / machine 上的线程调度

Numa 通常是从内存分配的角度来看待的,但本文试图探讨在不同的 vCPUs上启动线程会如何大幅度地影响性能。 在我们的实验中,我们已经看到性能上升到66% 。

Mysql 有一个名为 innodb_numa_interleave的选项,如果启用,它将尝试在 NUMA 节点之间统一分配缓冲池。 这很好,但是工作线程呢。 这些工作线程是否在 NUMA 节点上分配过于一致? 跨 NUMA 访问成本较高,因此最好让工作线程更接近数据,但鉴于这些工作线程的通用性,它们不应该被均匀分布。

假设我在一台24个 vCPU 机器上启动12个工作线程,这台机器上有2个 NUMA 节点,那么统一的分布预计会有6个工作线程绑定到 NUMA-node-0的 vCPU,剩下6个工作线程绑定到 NUMA-node-1的 vCPU。

操作系统(Linux)调度程序不是这样工作的。 它将尝试从一个 NUMA节点耗尽 vCPU,然后才会继续到另一个NUMA节点获取。

当工作线程(可伸缩性) < 核心数,所有这些都会大大影响性能。 甚至还会看到同一测试用例的不同性能,这些都是因为CPU跨NUMA的高访问成本, OS层面调度的不均衡, 以及核心切换导致的。

实验开始:

现在,让我们尝试看看Mysql吞吐量是如何如何由于工作线程所在的位置而更改的。

我使用同一台机器来运行client(sysbench)和server,因此client也会占用几个核心。 我们还考虑了客户端线程的位置,因为它是运行基准测试时的一个重要方面(除非你计划使用一些专用机器)。

  • 24 vCPU/48 GB VM,2 个NUMA nodes.
    • NUMA-1: 0-11 vCPU/24GB
    • NUMA-2: 12-23 vCPU/24GB
  • x86 (Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz) VM中每个物理内核有2个线程,所以24个 vCPU就是12个物理内核。 因此,我们还将探索两个工作线程位于不同的 vCPU 但处在相同物理核心的情况。
  • Test-Case: oltp-point-select(2 threads). 故意将其限制在两个线程内,以保持其他内核处于开放状态,从而允许操作系统执行内核切换(这会导致它独特的效果)。此外,所有测试数据都在内存中,使用 point-select 意味着没有执行 IO操作,因此 IO 瓶颈或后台线程大多处于空闲状态。 所有迭代的时间是60秒。
  • 为了让测试更加灵活,vCPUs/cores 使用 numactl (vs taskset)绑定到 sysbench 和 mysqld。
  • 对于服务器配置请看 here. Data-Size: 34G and BP: 36G. 在内存中生成测试数据并平均分布50% 的数据到 numa-0,剩下50% 的 numa-1。 Sysbench 使用range-type=uniform,这会让其触及大多数测试表的不同部分。
client-threads server-threads tps
Client-Threads bounded to vCPU: (0, 1, 12, 13) Server Thread bounded to vCPUs: (2-11, 14-23) 35188, 37426, 35140, 37640 37625, 35574, 35709, 37680

很显然 tps 在波动。 我做了进一步的研究,操作系统会一直继续做核心切换,导致 TPS 的波动(7% 的范围对于像这样的小测试场景来说太高了)。 此外,操作系统也同样一直在切换客户端线程核心。

这样的景象促使我对服务器核心绑定进行了更多的探索。

Client和Server线程所处的位置

Server-Threads: (Numa Node: 0, Physical Core: 2-5, vCPU: 4-11) Server-Threads: (Numa Node: 0, Physical Core: 2, 3, vCPU: 4, 6) Server-Threads: (Numa Node: 0, Physical Core: 2, vCPU: 4, 5)
Client Threads
(Numa Node: 0, Physical Core: 0, vCPU: 0,1) - Client+Server threads 处在同一NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) - Client threads 在同一物理核心上 TPS: 39570, 38656, 39633 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads 在同一物理核心上 TPS: 39395, 39481, 39814 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换 (在同一物理内核上) - Client threads 在同一物理核心上. TPS: 39889, 40270, 40457
(Numa Node: 0, Physical Core: 0,1, vCPU: 0,2) - Client+Server threads 处在同一NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) - Client threads 在不同物理核心上 TPS: 39890, 38698, 40005 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads on 在不同物理核心上. TPS: 40068, 40309, 39961 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换 (在同一物理内核上) - Client threads 在不同物理核心上. TPS: 40680, 40571, 40481
(Numa Node: 0, Physical Core: 0, vCPU: 0) - Client+Server threads 处在同一NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) - Client threads 在同一物理核心和同一vCPU上 TPS: 37642, 39730, 35984 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads 在同一物理核心和同一vCPU上 TPS: 40426, 40063, 40200 - Client+Server threads 处在同一NUMA节点 - Server threads (可能性小)执行核心切换 (在同一物理内核上) - Client threads 在同一物理核心和同一vCPU上 TPS: 40292, 40158, 40125
(Numa Node: 1, Physical Core: 6, vCPU: 12,13) - Client+Server threads 处在不同NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) -Client threads 在同一物理核心上 TPS: 34224, 34463, 34295 - Client+Server threads 处在不同NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads 在同一物理核心上 TPS: 34518, 34418, 34436 - Client+Server threads 处在不同NUMA节点 - Server threads (可能性小)执行核心切换 (在同一物理内核上) - Client threads 在同一物理核心上 TPS: 34282, 34512, 34583
(Numa Node: 1, Physical Core: 6,7, vCPU: 12,14) - Client+Server threads 处在不同NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) -Client threads 处在不同的物理核心上 TPS: 34462, 34127, 34620 - Client+Server threads 处在不同NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads 处在不同的物理核心上. TPS: 34438, 34379, 34419 - Client+Server threads 处在不同NUMA节点 - Server threads(可能性小)执行核心切换 (在同一物理内核上) - Client threads 处在不同的物理核心上. TPS: 34804,34453,34729
(Numa Node: 1, Physical Core: 6, vCPU: 12) - Client+Server threads 处在不同NUMA节点 - Server threads (可能性大)做核心切换 (基于OS调度) - Client threads处在相同的物理核心和vCPU上 TPS: 34989, 35162, 35245 - Client+Server threads 处在不同NUMA节点 - Server threads (可能性小)执行核心切换. - Client threads 处在相同的物理核心和vCPU上 TPS: 35503, 35455, 35632 - Client+Server threads 处在不同NUMA节点 - Server threads (可能性小)执行核心切换 (在同一物理内核上) - Client threads处在相同的物理核心和vCPU上 TPS: 35572, 35481, 35692

观察结果:

  • 限制服务器线程的核心有助于稳定性能(减少抖动)。 操作系统核心交换成本很高(如果具有不同的可伸缩性,这可能不太可行,但是是一个很好理解的点)。
  • 将客户端线程移动到不同的 NUMA 对性能有很大影响(40K-34K)。 我没有想到会这样,因为真正的工作是由服务器工作线程完成的,所以移动客户端线程不应该影响服务器性能到这个程度(17%)。

因此,从实验中我们了解到,如果客户机和服务器线程位于相同的 NUMA 上,并且使用一定工具减少OS核心交换(直到真的需要扩展核心数或者整体性能) ,则有助于实现最佳性能。

但是等等! 我们的目标是在 NUMA 节点上平衡客户端和服务器线程的分布,以获得最佳性能。

跨 NUMA 平衡客户端和服务器的线程

让我们应用上面获得的知识和数据来平衡 NUMA配置

client-threads server-threads tps remark
Client-Threads bounded to vCPU: 0, 1, 12, 13 Server Thread bounded to vCPUs: (2-11, 14-23) 35188, 37426, 35140, 37640, 37625, 35574, 35709, 37680 多核心交换
Client thread bounded to specific vCPU across NUMA (0,12) Server Thread bounded to specific vCPU across NUMA (4,16) 30001, 36160, 24403, 24354 37708, 24478, 36323, 24579 限制核心交换

Oops,结果比预期的还要糟糕。波动增加了。让我们来看看到底出了什么问题

  • 24K: OS 选择了threads倾斜的分布,其中 NUMA-x 运行客户端线程,而 NUMA-y 运行两个服务器线程
  • 37K: OS 选择在每个 NUMA 运行1个客户端和1个服务器线程的情况下很好地平衡了线程分布

(所有其他数字都是排列组合测试得到的)

让我们尝试一个可能的提示。 平衡 NUMA。你可以在这里here了解更多

echo 0 > /proc/sys/kernel/numa_balancing
client-threads server-threads tps remark
Client thread bounded to specific vCPUs across NUMA (0,12) Server Thread bounded to specific vCPUs across NUMA (4,16) 33628, 34190, 35380, 37572 限制核心交换 + 禁用NUMA平衡。抖动仍然存在,但肯定比上面提到的24K 情况要好
  • 如果我们将客户端线程绑定到特定的 NUMA上的核心并平衡服务器线程会怎样?
client-threads server-threads tps remark
Client thread bounded to specific vCPU NUMA (0) Server Thread bounded to specific vCPUs across NUMA (4,16) 36742, 36326, 36701, 36570 限制核心交换 + 禁用NUMA平衡。 看起来很平均。
Client thread bounded to specific vCPU NUMA (12) Server Thread bounded to specific vCPUs across NUMA (4,16) 35440, 35667, 35748, 35578 限制核心交换 + 禁用NUMA平衡。看起来很平均。

总结

通过上面的多个实验,我们看到了给定的测试用例如何因为运行客户机和服务器线程的位置和方式得到了从24K 到40K 不等的不同性能数据。

如果你的基准测试真的只关心较低的可伸缩性,那么你应该注意核心分配。

常用的降噪策略有运行测试 n 次的平均值、 n 次的中位数、 n 次的最优值等。但是如果方差是很大的话,没有一个是最有效的。 我倾向于使用平均 n 次最小时间运行测试的策略,因此概率上的数据趋于稳定。我不确定这是否是最好的方法,但似乎它有助于将噪音到一定的程度。 较小的样本(n 值较小)会增加噪声,所以我建议 n =9至少每次运行(60 + 10(tc-warmup))秒,所以630秒的测试用例运行时间足以减少抖动。

如果你有更好的方案,请与社区分享。

另外,随着 NUMA 节点的增加和 ARM 上核心的增加,情况变得更加复杂。如果有人研究过它,我会很乐意去理解它。

如果你有疑问,请让我知道,我会尽力回答。

{% raw %}

{% endraw %} {% raw %}
{% endraw %}

Scheduling threads on NUMA enabled VM/Machine

NUMA is often looked upon from a memory allocation perspective but the article tries to explore how booting thread on different vCPUs can affect performance in a big way. During our experiment we have seen performance swing upto 66%.

MySQL has an option named innodb_numa_interleave that if enabled will try to uniformly allocate the buffer pool across the NUMA node. This is good but what about the worker threads. Are these worker threads too uniformly allocated across the NUMA node? Cross NUMA access is costlier and so having a worker thread closer to the data is always prefered but given the generic nature of these worker threads shouldn’t they be uniformly distributed.

Say I am booting 12 worker threads on a 24 vCPU machine with 2 NUMA nodes then uniform distribution would expect 6 worker threads bound to vCPUs from NUMA-node-0 and remaining 6 to vCPUs from NUMA-node-1.

The OS (Linux) scheduler doesn’t work it that way. It would try to exhaust vCPUs from one of the NUMA-nodes and then proceed to another.

All this could affect performance big-way till your worker threads (scalability) < number-of-cores. Even beyond this point one may see varying performances for the same test-case due to core-switches.

Understanding the setup:

Let’s now try to see how the throughput can change based on where the worker threads are located.

I am using the same machine to run client (sysbench) and server so few cores are reserved for clients too. We also consider the position of client threads as it is an important aspect while running benchmark (unless you plan to use some dedicated machine for it).

  • 24 vCPU/48 GB VM with 2 NUMA nodes.
    • NUMA-1: 0-11 vCPU/24GB
    • NUMA-2: 12-23 vCPU/24GB
  • x86 (Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz) VM has 2 threads per physical core so 24 vCPU = 12 physical cores. So we would also explore scenarios when both worker threads are located on different vCPU but have the same physical core.
  • Test-Case: oltp-point-select with 2 threads. Purposely limiting it to 2 threads to keep other cores open to allow the OS to do core-switches (this has its own sweet effect). Also, all data is in-memory and using point-select means no IO being done so IO bottlenecks or background threads are mostly idle. All iterations are timed for 60 seconds.
  • vCPUs/cores are bound to sysbench and mysqld using numactl (vs taskset) given the flexibility it provides.
  • For server configuration please check here. Data-Size: 34G and BP: 36G. Complete data in memory and equally distributed so 50% of data on numa-0 and remaining 50% of numa-1. Sysbench uses range-type=uniform that should touch most of the varied parts of the table.
client-threads server-threads tps
Client-Threads bounded to vCPU: (0, 1, 12, 13) Server Thread bounded to vCPUs: (2-11, 14-23) 35188, 37426, 35140, 37640 37625, 35574, 35709, 37680

Naturally the tps is fluctuating. Some closer look revealed that OS continues to do core-switch that causes TPS to fluctuate (range of 7% is too high for small test-case like this). Also, OS continues to switch client threads cores too.

That prompted me to explore more about server core binding. As part of completeness I also explored client thread positioning.

Position of Client and Server Threads

Server-Threads: (Numa Node: 0, Physical Core: 2-5, vCPU: 4-11) Server-Threads: (Numa Node: 0, Physical Core: 2, 3, vCPU: 4, 6) Server-Threads: (Numa Node: 0, Physical Core: 2, vCPU: 4, 5)
Client Threads
(Numa Node: 0, Physical Core: 0, vCPU: 0,1) - Client+Server threads on same NUMA - Server threads may do core-switch (OS-scheduler dependent) - Client threads on the same physical core TPS: 39570, 38656, 39633 - Client+Server threads on same NUMA - Server threads are less likely to do core-switch. - Client threads on the same physical core. TPS: 39395, 39481, 39814 - Client+Server threads on same NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on the same physical core. TPS: 39889, 40270, 40457
(Numa Node: 0, Physical Core: 0,1, vCPU: 0,2) - Client+Server threads on same NUMA - Server threads may do core-switch (OS-scheduler dependent) - Client threads on different physical core TPS: 39890, 38698, 40005 - Client+Server threads on same NUMA - Server threads are less likely to do core-switch. - Client threads on different physical core. TPS: 40068, 40309, 39961 - Client+Server threads on same NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on different same physical core. TPS: 40680, 40571, 40481
(Numa Node: 0, Physical Core: 0, vCPU: 0) - Client+Server threads on same NUMA - Server threads may do core-switch (OS-scheduler dependent) - Client threads on same physical core and same vCPU TPS: 37642, 39730, 35984 - Client+Server threads on same NUMA - Server threads are less likely to do core-switch. - Client threads on same physical core and same vCPU TPS: 40426, 40063, 40200 - Client+Server threads on same NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on same physical core and same vCPU TPS: 40292, 40158, 40125
(Numa Node: 1, Physical Core: 6, vCPU: 12,13) - Client+Server threads on different NUMA - Server threads may do core-switch (OS-scheduler dependent) -Client threads on the same physical core TPS: 34224, 34463, 34295 - Client+Server threads on different NUMA - Server threads are less likely to do core-switch. - Client threads on the same physical core. TPS: 34518, 34418, 34436 - Client+Server threads on different NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on the same physical core. TPS: 34282, 34512, 34583
(Numa Node: 1, Physical Core: 6,7, vCPU: 12,14) - Client+Server threads on different NUMA - Server threads may do core-switch (OS-scheduler dependent) -Client threads on different physical core TPS: 34462, 34127, 34620 - Client+Server threads on different NUMA - Server threads are less likely to do core-switch. - Client threads on different physical core. TPS: 34438, 34379, 34419 - Client+Server threads on different NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on different same physical core. TPS: 34804,34453,34729
(Numa Node: 1, Physical Core: 6, vCPU: 12) - Client+Server threads on different NUMA - Server threads may do core-switch (OS-scheduler dependent) - Client threads on same physical core and same vCPU TPS: 34989, 35162, 35245 - Client+Server threads on different NUMA - Server threads are less likely to do core-switch. - Client threads on same physical core and same vCPU TPS: 35503, 35455, 35632 - Client+Server threads on same NUMA - Server threads less likely to do core-switch (on same physical core) - Client threads on different physical core and same vCPU TPS: 35572, 35481, 35692

Observations:

  • Limiting cores for server threads helps stabilize the performance (reduce jitter). OS-core switches are costly (with varying scalability this may not be feasible but a good parameter to understand).
  • Moving client thread to different NUMA affects performance in a big way (40K -> 34K). I was not expecting this since the real work is done by server worker threads so moving client threads should not affect server performance to this level (17%).

So from the experiment we learned that client and server threads if co-located on the same NUMA and technique to reduce core-switch (till it is really needed with increased scalability) helps achieve optimal performance.

But wait! Our goal is to have balance distribution of client and server threads across the NUMA node to get optimal performance.

Balance Client and Server Threads across NUMA

Let’s apply the knowledge gained above to balance numa configuration

client-threads server-threads tps remark
Client-Threads bounded to vCPU: 0, 1, 12, 13 Server Thread bounded to vCPUs: (2-11, 14-23) 35188, 37426, 35140, 37640, 37625, 35574, 35709, 37680 Lot of core switches
Client thread bounded to specific vCPU across NUMA (0,12) Server Thread bounded to specific vCPU across NUMA (4,16) 30001, 36160, 24403, 24354 37708, 24478, 36323, 24579 Limit core switches

Oops it turned out to be worse than expected. Fluctuation increased. Let’s understand what went wrong

  • 24K: OS opted for skewed distribution with NUMA-x running both client threads and NUMA-y running both server threads.
  • 37K: OS opted for well balance distribution with each NUMA running 1 client and 1 server thread.

(All other numbers are mix of combinations)

Let’s try a possible hint. NUMA balancing. You can read more about it here

echo 0 > /proc/sys/kernel/numa_balancing
client-threads server-threads tps remark
Client thread bounded to specific vCPUs across NUMA (0,12) Server Thread bounded to specific vCPUs across NUMA (4,16) 33628, 34190, 35380, 37572 Limit core switches + NUMA balancing disabled. Jitter is still there but surely better than 24K case above.
  • What if we bind client thread to specific numa cores and balance server threads
client-threads server-threads tps remark
Client thread bounded to specific vCPU NUMA (0) Server Thread bounded to specific vCPUs across NUMA (4,16) 36742, 36326, 36701, 36570 Limiting core switches + NUMA balancing disabled. Looks well balanced now.
Client thread bounded to specific vCPU NUMA (12) Server Thread bounded to specific vCPUs across NUMA (4,16) 35440, 35667, 35748, 35578 Limit core switches + NUMA balancing disabled. Looks well balanced now.

Conclusion

Through multiple experiments above we saw how the given test-case can help produce different results ranging from 24K -> 40K based on where and how you run client and server threads.

If your benchmark really cares about the lower scalability then you should watch out for the core allocation.

Usual strategies to reduce noise are average of N runs, median of N runs, best of N runs, etc… But if the variance is that high none of them will work best. I tend to use strategy of averaging N smaller time runs so with probabilty things could stablize. Not sure if this is best approach but seems like it help reduce the noise to quite some level. Lesser sample (smaller value of N) would increase noise so I would recommend N = 9 at-least with each run of (60+10 (tc-warmup)) secs so 630 seconds run of test-case is good enough to reduce the jitter.

If you have better alternative, please help share it with community.

BTW, story is more complex with increasing NUMA nodes and more cores on ARM. Topic of future. If anyone has studied it, would love to understand it.

If you have more questions/queries do let me know. Will try to answer them.

{% raw %}

{% endraw %}