kunpengcompute / kunpengcompute.github.io

Kunpeng Tech Blog: https://kunpengcompute.github.io/
Apache License 2.0
17 stars 5 forks source link

PostgreSQL ARM和X86性能比拼 #31

Open bzhaoopenstack opened 4 years ago

bzhaoopenstack commented 4 years ago

译者: bzhaoopenstack 作者: Amit Dattatray Khandekar 原文链接: https://amitdkhan-pg.blogspot.com/2020/05/postgresql-on-arm.html

由团队内部PostgreSQL大牛Amit在ARM和X86上针对PostgreSQL的性能比拼测试。

{% raw %}

{% endraw %}

PostgreSQL on ARM

在我的上个博客, 我写道,如果那些已经运行在X86的应用程序要在不同的架构上运行,比如 ARM,那么这些应用程序可能需要进行一些优化。 让我们来看看它具体指的是什么。

最近我一直在使用 ARM64机器测试 PostgreSQL RDBMS。几个月前,我甚至不知道它是否可以在 ARM 上编译,因为忽略了一个事实,即我们有一个用于 ARM64的常规构建机器,已经很长时间了. 现在甚至连 PostgreSQL apt 库也开始制作ARM64 PostgreSQL 的包了。但是在我用不同的场景测试了 PostgreSQL-on-ARM 之后,我才真正对它的可靠性有了信心。

我从read-only pgbench 测试开始,比较了 x86_64和 ARM_64虚机的测试结果。 目的不是比较任何特定的 CPU 实现。 这个想法是为了找出ARM 上的 PostgreSQL 与X86相比表现不尽如人意的场景。

Test Configuration

ARM64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 2.6 GHz; available RAM : 11GB x86_64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 3.0 GHz; available RAM : 11GB

以下是所有测试的共用的配置:

PostgreSQL parameters changed : shared_buffers = 8GB pgbench scale factor : 30 pgbench command : for num in 2 4 6 8 10 12 14 do pgbench [-S] -c $num -j $num -M prepared -T 40 done 它的意思是: pgbench 运行的并行client数量越来越多,从2个到14个不等。

Select-only workload

pgbench -S 选项应用在 read-only workload.

img

在2个线程和4个线程之间,x86的性能比 ARM 高出30% ,而且差距越来越大。 当线程数在4到6之间,曲线变得平坦了一点,再到线程数6到8之间,曲线突然变得陡峭。 然后线程数到达8之后,由于测试机上有8个 cpu,预计它会变平或下降。 但是还有更多的原因。 这里Pgbench client运行在与安装 PostgreSQL server相同的一台机器上。 通过充分利用 cpu,Pgbench client 占用了大约20% 的 cpu。 所以client从6个线程开始对测试产生干扰。 尽管如此,ARM 和 x86的性能差距还是在线程数6到8之间急剧上涨。 我还没有理解为什么会这样,可能与 Linux 调度程序以及 pgbench client和PostgreSQL server之间的交互有关。 注意,x86和 ARM 的曲线形状基本相似。 所以这种行为并不是架构特有的。 不过,这些曲线的一个不同之处在于: ARM 曲线从8个线程开始下降幅度稍大一些。 此外,线程数在6到8之间时,ARM 的处理事务量的增长并不像 x86那样剧烈。 因此,这种情况的最终结果是: 随着 cpu 变得越来越忙,ARM 上的 PostgreSQL 越来越落后于 x86。 让我们看看如果移除 pgbench client带来的干扰会发生什么。

select exec_query_in_loop(n)

因此,为了避免由于同一台机器上的pgbench client对PostgreSQL server造成的干扰,我想测试一下它的查询性能。为此,pgbench client运行在另一台机器上,但这可能会产生另一种的噪音: 网络延迟。 所以,我写了一个 PostgreSQL C language user-defined function ,用来循环执行与 pgbench 测试运行的完全相同的 SQL 查询。 使用 pgbench 自定义脚本执行此函数。现在,pgbench client大部分都是空闲的。 另外,这不会占用提交 / 回滚程度时间,因为大部分时间将花费在这个C 函数上。

pgbench 自定义脚本 : select exec_query_in_loop(n); 其中 n 是 pgbench 查询在PostgreSQL server上一次循环执行的次数 与pgbench -S选项作用下的普通循环查询: SELECT abalance FROM pgbench_accounts WHERE aid = $1 详情参看 exec_query_in_loop()

img

现在,你看到一个非常不同的曲线。 对于这两条曲线,最多为8个线程,事务率与线程数成线性比例。 正如预期的那样,在线程数达到8以后,处理的事务就不会上升了。 而且,即使对 ARM 来说,它也有相同的行为。 与 x86相比,PostgreSQL 在 ARM上从头到尾慢了35% 左右。 考虑到 ARM 处理器的频率是2.6 GHz,而 x86是3.0 GHz,这么一说看起来这性能还不错。 注意,事务率是个位数,因为函数 exec_query_in_loop(n)中是用 n=100000来执行的。

这个实验还表明,之前使用内置 pgbench 脚本的性能测试结果与 pgbench client干扰有关。 而且,ARM 对于竞争线程的倾斜曲线不是由服务器中的争用引起的。 请注意,事务率是在客户端计算的。 因此,特别是在高争用场景中, 即使查询中的结果已经准备就绪,然而client请求结果、计算时间戳等仍然可能会有一些延迟。

select exec_query_in_loop(n) - PL/pgSQL function

在使用用户定义的 c 函数之前,我使用了 PL/pgSQL function来做同样的事. 我偶然发现了一种不同的表现行为。

img

在这里,无论线程数量如何,ARM 上的 PostgreSQL 都比 x86慢65% 左右。 与之前使用 C 函数的结果相比,由于某种原因,很明显 PL/pgsql 在 ARM 上的执行速度非常慢。 我检查了 perf 输出的报告,在 ARM 和 x86中看到的热点函数大致相同。 但由于某些原因,在 PL/pgsql 函数内执行的任何操作在 ARM 上都比在 x86上慢得多。

我还没有检查缓存失败,看看缓存失败是否在 ARM 上会更多。 在撰写本文时,我所做的是这样的(在PostgreSQL内部是这样的) : exec_stmt_foreach_a()调用exec_stmt()。 我将 exec_stmt()克隆为 exec_stmt_clone() ,并将 exec_stmt_foreach_a()调用 exec_stmt_clone()。 这加快了整体执行的速度,对于 ARM 来说却加快了20%多 。 到目前为止,这种变化为什么会导致这种行为,对我来说还是一个谜。 可能与程序中某个代码位置有关,这点我还不确定。

Updates

默认 pgbench 选项运行与 tpcb类似的内置脚本,该脚本对多个表进行了一些更新操作。

img

与 x86相比,ARM 上的事务处理率大约比X86慢1%-10% 。 这可能是因为大部分时间用于等待锁,而在提交过程中的磁盘写操作。 我使用的磁盘是非 SSD 磁盘。 但总体来看,PostgreSQL在 ARM 上的更新表操作在 ARM 上运行良好。

接下来,我将测试聚合查询、分页、更多CPU核数 (32 / 64 / 128)、更大的 RAM 和更高的规模因数,以便相对地了解 PostgreSQL 在拥有大量资源的两个平台上的性能扩展情况。

结论

我们看到,PostgreSQL RDBMS 在 ARM64上工作得相当稳定。 虽然在比较两个不同平台上的性能很棘手,但是我们仍然可以通过比较两个平台中不同场景中的行为来判断它哪些方面做得不好。

{% raw %}

{% endraw %} {% raw %}
{% endraw %}

PostgreSQL on ARM

In my last blog, I wrote that applications that have been running on x86 might need to undergo some adaptation if they are to be run on a different architecture such as ARM. Let's see what it means exactly.

Recently I have been playing around with PostgreSQL RDBMS using an ARM64 machine. A few months back, I even didn't know whether it can be compiled on ARM, being oblivious of the fact that we already have a regular build farm member for ARM64 for quite a while. And now even the PostgreSQL apt repository has started making PostgreSQL packages available for ARM64 architecture. But the real confidence on the reliability of PostgreSQL-on-ARM came after I tested it with different kinds of scenarios.

I started with read-only pgbench tests and compared the results on the x86_64 and the ARM64 VMs available to me. The aim was not to compare any specific CPU implementation. The idea was to find out scenarios where PostgreSQL on ARM does not perform in one scenario as good as it performs in other scenarios, when compared to x86.

Test Configuration

ARM64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 2.6 GHz; available RAM : 11GB x86_64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 3.0 GHz; available RAM : 11GB

Following was common for all tests :

PostgreSQL parameters changed : shared_buffers = 8GB pgbench scale factor : 30 pgbench command : for num in 2 4 6 8 10 12 14 do pgbench [-S] -c $num -j $num -M prepared -T 40 done What it means is : pgbench is run with increasing number of parallel clients, starting from 2 to 14.

Select-only workload

pgbench -S option is used for read-only workload.

img

Between 2 and 4 threads, the x86 performance is 30% more than ARM, and the difference rises more and more. Between 4 and 6, the curves flatten a bit, and between 6 and 8, the curves suddenly become steep. After 8, it was expected to flatten out or dip, because the machines had 8 CPUs. But there is more to it. The pgbench clients were running on the same machines where servers were installed. And with fully utilized CPUs, the clients took around 20% of the CPUs. So they start to interfere from 6 threads onward. In spite of that, there is a steep rise between 6 and 8, for both ARM and x86. This is not yet understood by me, but possibly it has something to do with the Linux scheduler, and the interaction between the pgbench clients and the servers. Note that, the curve shape is mostly similar on both x86 and ARM. So this behaviour is not specific to architectures. One difference in the curves, though, is : the ARM curve has a bit bigger dip from 8 threads onward. Also, betweeen 6 and 8, the sudden jump in transactions is not that steep for ARM compared to x86. So the end result in this scenario is : As the CPUs become more and more busy, PostgreSQL on ARM lags behind x86 more and more. Let's see what happens if we remove the interference created by pgbench clients.

select exec_query_in_loop(n)

So, to get rid of the noise occurring because of both client and server on the same machines, I arranged for testing exactly what I intended to test: query performance. For this, pgbench clients can run on different machines, but that might create a different noise: network latency. So instead, I wrote a PostgreSQL C language user-defined function that keeps on executing in a loop the same exact SQL query that is run by this pgbench test. Execute this function using the pgbench custom script. Now, pgbench clients would be mostly idle. Also, this won't take into account the commit/rollback time, because most of the time will be spent inside the C function.

pgbench custom script : select exec_query_in_loop(n); where n is the number of times the pgbench query will be executed on the server in a loop. The loop query is the query that gets normally executed with pgbench -S option: SELECT abalance FROM pgbench_accounts WHERE aid = $1 Check details in exec_query_in_loop()

img

Now, you see a very different curve. For both curves, upto 8 threads, transactions rate is linearly proportional to number of threads. After 8, as expected, the transactions rate doesn't rise. And it has not dipped, even for ARM. PostgreSQL is consistently around 35% slower on x86 compared to ARM. This sounds not that bad when we consider that the ARM CPU frequency is 2.6 GHz whereas x86 is 3.0 Gz. Note that the transaction rate is single digit, because the function exec_query_in_loop(n) is executed with n=100000.

This experiment also shows that the previous results using built-in pgbench script have to do with pgbench client interference. And that, the dip in curve for ARM for contended threads is not caused by the contention in the server. Note that, the transactions rates are calculated at client side. So even when a query is ready for the results, there may be some delay in the client requesting the results , calculating the timestamp, etc, especially in high contention scenarios.

select exec_query_in_loop(n) - PLpgSQL function

Before using the user-defined C function, I had earlier used a PL/pgSQL function to do the same work. There, I stumbled across a different kind of performance behaviour.

img

Here, PostgreSQL on ARM is around 65% slower than on x86, regardless of number of threads. Comparing with the previous results that used a C function, it is clear that PL/pgSQL execution is remarkably slower on ARM, for some reason. I checked the perf report, but more or less the same hotspot functions are seen in both ARM and x86. But for some reason, anything executed inside PL/pgSQL function becomes much slower on ARM than on x86.

I am yet to check the cache misses to see if those are more on ARM. As of this writing, what I did was this (some PostgreSQL-internals here) : exec_stmt_foreach_a() calls exec_stmt(). I cloned exec_stmt() to exec_stmt_clone(), and made exec_stmt_foreach_a() call exec_stmt_clone() instead. This sped up the overall execution, but it sped up 20% more for ARM. Why just this change caused this behaviour is kind of a mystery to me as of now. May be it has to do with the location of a function in the program; not sure.

Updates

The default pgbench option runs the tpcb-like built-in script, which has some updates on multiple tables.

img

Here, the transaction rate is only around 1-10% percent less on ARM compared to x86. This is probably because major portion of the time goes in waiting for locks, and in disk writes during commits. And the disks I used are non-SSD disks. But overall it looks like, updates on PostgreSQL are working good on ARM.

Next thing, I am going to test with aggregate queries, partitions, high number of CPUs (32/64/128), larger RAM and higher scale factor, to relatively see how PostgreSQL scales on the two platforms with large resources.

Conclusion

We saw that PostgreSQL RDBMS works quite robustly on ARM64. While it is tricky to compare the performance on two different platforms, we could still identify which areas it is not doing good by comparing patterns of behaviour in different scenarios in the two platforms.

{% raw %}

{% endraw %}

code33 commented 3 years ago

请问贴主,这款ARM和X86分别是啥子型号呀?

bzhaoopenstack commented 3 years ago

@code33 HI, ARM64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 2.6 GHz; available RAM : 11GB x86_64 VM: Ubuntu 18.04.3; 8 CPUs; CPU frequency: 3.0 GHz; available RAM : 11GB

The CPU type of X86 is Intel Cascade Lake 3.0GHz. The ARM one is Kunpeng 920.

All tests were running on VMs which on Huawei Cloud. Thanks