kunpengcompute / kunpengcompute.github.io

Kunpeng Tech Blog: https://kunpengcompute.github.io/
Apache License 2.0
17 stars 5 forks source link

指定NUMA节点运行MySql #36

Open bzhaoopenstack opened 4 years ago

bzhaoopenstack commented 4 years ago

译者: bzhaoopenstack 作者: Krunal Bauskar 原文链接: https://mysqlonarm.github.io/Running-MySQL-on-Selected-NUMA-nodes/

指定NUMA节点来运行Mysql,全路程实践,来试试吧

{% raw %}

{% endraw %}

“在选定的 NUMA 节点上运行 MySQL”看起来非常简单,但不幸的是,它并不简单。最近,我遇到了一个情况,需要在2个(甚至4个) NUMA 节点上运行 MySQL。

当然,我尝试的第一件事就是使用 numactl --physcpubind 限制 CPU/Core 集合,只从选定的NUMA节点选择CPUs和Core。MySQL 配置为 innodb_numa_interleave=1 ,因此我希望它仅从这个选定的NUMA 节点分配内存(因为我限制了 CPU/core 的使用)。

Suprise-1:

MySQL 使用 numa_all_nodes_ptr->maskp 这意味着即使 CPU 任务集被限制为2个 NUMA 节点。

Daniel Black告诉我并让我意识到的两个问题:

上述问题建议切换到一个更符合逻辑的 numa_get_mems_allowed(). 根据文档,它应该返回一个节点的掩码,告知这个节点被允许为特定的进程分配内存。

ref-from-doc: numa_get_mems_allowed() returns the mask of nodes from which the process is allowed to allocate memory in it's current cpuset context.

所以我决定应用这个补丁并继续测试。

Suprise-2:

仅仅使用补丁和限制 cpu/core集合并没有帮助,所以我想尝试使用 membind 选项。

Suprise-3:

所以现在这个命令看起来像:

numactl --physcpubind= --membind=0,1

这一次,我当然只希望从选定的 NUMA 节点分配内存,但它仍然没有。它从所有4个节点分配内存。

经过一番文档搜索,建议对 numa_all_nodes_ptr 查看 mems_allowed 字段,如下所述a

numa_all_nodes_ptr: The set of nodes to record is derived from /proc/self/status, field "Mems_allowed". The user should not alter this bitmask.

正如 Alexey Kopytov 在 PR # 138中指出的, numa_all_nodes_ptrnuma_get_mems_allowed r允许读取相同的NUMA掩码。

这意味着 numa_get_mems_allowed已经失效,或者文档需要更新。

为了完全确认,我还尝试了 numctl-interleave,但这也没有帮助

事实验证:

因此,我决定使用一个简单的程序(在 MySQL 之外)来验证上述事实。

#include <iostream>
#include <numa.h>
#include <numaif.h>
using namespace std;
int main()
{
cout << *numa_all_nodes_ptr->maskp << endl;
cout << *numa_get_mems_allowed()->maskp << endl;
}

numactl --membind=0-1 ./a.out
15
15

很明显,当 numa_get_mems_allowed 返回的只是允许分配内存的NUMA节点时,两者似乎返回相同的掩码值。

解决方案:

我迫切需要一个解决方案,所以尝试使用一个简单的工作方式手动填补掩码(将继续跟进与操作系统供应商 numactl 行为)。这种方法最终奏效了,现在只能从选定的 NUMA 节点分配内存。

+const unsigned long numa_mask = 0x3;

 struct set_numa_interleave_t {
   set_numa_interleave_t() {
     if (srv_numa_interleave) {
       ib::info(ER_IB_MSG_47) << "Setting NUMA memory policy to"
                                 " MPOL_INTERLEAVE";
-      if (set_mempolicy(MPOL_INTERLEAVE, numa_all_nodes_ptr->maskp,
+      if (set_mempolicy(MPOL_INTERLEAVE, &numa_mask,
                         numa_all_nodes_ptr->size) != 0) {
         ib::warn(ER_IB_MSG_48) << "Failed to set NUMA memory"
                                   " policy to MPOL_INTERLEAVE: "
@@ -1000,7 +1001,7 @@ static buf_chunk_t *buf_chunk_init(
 #ifdef HAVE_LIBNUMA
   if (srv_numa_interleave) {
     int st = mbind(chunk->mem, chunk->mem_size(), MPOL_INTERLEAVE,
-                   numa_all_nodes_ptr->maskp, numa_all_nodes_ptr->size,
+                   &numa_mask, numa_all_nodes_ptr->size,
                    MPOL_MF_MOVE);
     if (st != 0) {
       ib::warn(ER_IB_MSG_54) << "Failed to set NUMA memory policy of"

(当然,这需要从源代码重新构建,而不是二进制/包(如果想用,往下看)).

那你为什么不用... ?

当然,大多数人可能会建议通过将 innodb_numa_interleave 关闭而使用 membind 来避免这种情况. 当然,这种方法是可行的,但是这种方法略有不同,因为所有分配的内存都受上述限制的约束,而innodb_numa_interleave 仅在缓冲池分配期间适用。它可能应用于特定的目的,但可能不能像这样比较。

这已经在我的待办事项列表中,以检查 complete interleave vs innodb_numa_interleave带来的影响。

总结

NUMA 节点上的平衡分配有多个方面,包括核心选择、内存分配、线程分配(同样在选定的 NUMA 节点上)等等。更过的惊喜和令人兴奋的东西等待我们去探索。

如果你有问题/疑问,请联系我。

{% raw %}

{% endraw %} {% raw %}
{% endraw %}

“Running MySQL on selected NUMA node(s)” looks pretty straightforward but unfortunately it isn’t. Recently, I was faced with a situation that demanded running MySQL on 2 (out of 4) NUMA nodes.

Naturally, the first thing I tried was to restrict CPU/Core set using numactl --physcpubind selecting only the said CPUs/cores from the said NUMA nodes. MySQL was configured to use innodb_numa_interleave=1 so I was expecting it to allocate memory from the said NUMA nodes only (as I restricted usage of CPU/core).

Suprise-1:

MySQL uses numa_all_nodes_ptr->maskp that means all the nodes are opted even though the CPU task-set is limited to 2 NUMA nodes.

Some lookout pointed me to these 2 issues from Daniel Black

Issue proposes to switch to a more logical numa_get_mems_allowed(). As per the documentation it should return a mask of the node that are are allowed to allocate memory for the said process.

ref-from-doc: numa_get_mems_allowed() returns the mask of nodes from which the process is allowed to allocate memory in it's current cpuset context.

So I decided to apply the patch and proceed.

Suprise-2:

Just applying patch and relying on cpu/core set didn’t helped. So I thought of trying with membind option.

Suprise-3:

So now the command looks like:

numactl --physcpubind= --membind=0,1

This time I surely expected that memory would be allocated from the said NUMA nodes only but it still didn’t. Memory was allocated from all 4 nodes.

Some more documentation search, suggested that for numa_all_nodes_ptr looks at mems_allowed field as mentioned below

numa_all_nodes_ptr: The set of nodes to record is derived from /proc/self/status, field "Mems_allowed". The user should not alter this bitmask.

and as Alexey Kopytov pointed in PR#138, numa_all_nodes_ptr and numa_get_mems_allowed reads the same mask.

This tends to suggest that numa_get_mems_allowed is broken or documentation needs to be updated.

Just for completeness, I also tried numctl –interleave but that too didn’t helped

Fact Validation:

So I decided to try this using a simple program (outside MySQL) to validate the said fact.

#include <iostream>
#include <numa.h>
#include <numaif.h>
using namespace std;
int main()
{
cout << *numa_all_nodes_ptr->maskp << endl;
cout << *numa_get_mems_allowed()->maskp << endl;
}

numactl --membind=0-1 ./a.out
15
15

It is pretty clear that both seem to return the same mask value when numa_get_mems_allowed should return only memory allowed nodes.

Workaround:

I desperately needed a solution so tried using a simple workaround of manually feeding the mask (will continue to follow up about numactl behavior with OS vendor). This approach finally worked and now I can allocate memory from selected NUMA nodes only.

+const unsigned long numa_mask = 0x3;

 struct set_numa_interleave_t {
   set_numa_interleave_t() {
     if (srv_numa_interleave) {
       ib::info(ER_IB_MSG_47) << "Setting NUMA memory policy to"
                                 " MPOL_INTERLEAVE";
-      if (set_mempolicy(MPOL_INTERLEAVE, numa_all_nodes_ptr->maskp,
+      if (set_mempolicy(MPOL_INTERLEAVE, &numa_mask,
                         numa_all_nodes_ptr->size) != 0) {
         ib::warn(ER_IB_MSG_48) << "Failed to set NUMA memory"
                                   " policy to MPOL_INTERLEAVE: "
@@ -1000,7 +1001,7 @@ static buf_chunk_t *buf_chunk_init(
 #ifdef HAVE_LIBNUMA
   if (srv_numa_interleave) {
     int st = mbind(chunk->mem, chunk->mem_size(), MPOL_INTERLEAVE,
-                   numa_all_nodes_ptr->maskp, numa_all_nodes_ptr->size,
+                   &numa_mask, numa_all_nodes_ptr->size,
                    MPOL_MF_MOVE);
     if (st != 0) {
       ib::warn(ER_IB_MSG_54) << "Failed to set NUMA memory policy of"

(Of-course this needs re-build from source code and not an option for binary/package user (well there is .. check following section)).

But then why didn’t you used … ?

Naturally, most of you may suggest that this could be avoided by toggling innodb_numa_interleave back to OFF and using membind. Of-course this approach works but this approach is slightly different because then all the memory allocated is bounded by the said restriction vs innodb_numa_interleave is applicable only during buffer pool allocation. It may serve specific purpose but may not be so called comparable.

This has been on my todo list to check effect of complete interleave vs innodb_numa_interleave.

Conclusion

Balance distribution on NUMA node has multiple aspects including core-selection, memory allocation, thread allocation (equally on selected numa node), etc…. Lot of exciting and surprising things to explore.

If you have more questions/queries do let me know. Will try to answer them.

{% raw %}

{% endraw %}