larson benchmark 下性能浮动过大

maokelong commented 7 years ago

larson 是一款由 Paul Larson 提出的 benchmark，它模拟了服务端应用响应客户端请求的场景。larson 中的线程会接受一个对象集，然后随机地释放其中的内存，或申请新的内存添加到其中，接着 larson 会把这个对象集传递给下一个线程。

在 PRE_LOAD CMalloc 之后执行 larson，发现运行结果浮动很大，在 6 线程情况下最优结果和最劣结果的差距可达 5 倍，结果显示 benchmark 开启的总线程数从 11 到 52 不等（总线程数越大，操作次数阅读，性能越好；运行结果差异越小，性能越稳定）。

maokelong commented 7 years ago

根据 larson 的特征，cmalloc 在该 benchmark 会频繁执行堆分配、堆回收、远程释放、内存申请的操作，考虑到在不涉及堆回收及远程释放操作的多线程 benchmark——threadtest 下，cmalloc 表现很好，我认为是堆回收或与远程释放有关的地方优化欠妥，甚至是有 BUG。

注意到 scalloc 的论文中谈到， larson 所有释放操作中只有 1% 是远程释放操作，因此决定先注释掉远程释放路径，根据其影响判断远程释放路径中是否存在问题（尽管不报有太大希望）。

maokelong commented 7 years ago

[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  46946140 , "rss": 2502 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  38956482 , "rss": 2481 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  51588028 , "rss": 2746 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  40138770 , "rss": 2526 }

取消远程释放路径后结果依然相当不稳定，现着重考虑堆分配与堆回收操作中是否存在问题。

maokelong commented 7 years ago

[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  31049264 , "rss": 2738 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  43084911 , "rss": 2988 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  35864027 , "rss": 2728 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  41844723 , "rss": 3068 }

注释了与堆重用有关的代码后，运行结果依然很不稳定，因此现在基本排除了堆分配/回收和远程释放/重用出错的可能性，觉得可能是因为 cmalloc 内部某模块性能不稳定。考虑到代码就只有那么几处复杂的地方，因此开始着重考虑是不是状态转换机出了问题。

maokelong commented 7 years ago

[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  34139138 , "rss": 2448 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  26410154 , "rss": 2318 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  48182009 , "rss": 2614 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  51592579 , "rss": 2668 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  34106185 , "rss": 2512 }

注释掉了将 superblock 冻结的代码和将部分冻结的 superblock 返回到 globalpool 的代码，结果依旧不稳定。开始有点想放弃治疗。

maokelong commented 7 years ago

将所有函数的内联取消，通过 Intel Amplxe 对 benchmark 进行取样分析，发现即使每次运行结果有所差异，甚至差异巨大，但取样结果中 cmalloc 内部各函数的执行频率的波动却很小！据此可以判定运行结果差异并非主要由 benchmark 跳转分支分布不同所致。大胆猜测是由于 cache 命中率的波动所致。修改超级元数据块的布局，以提高它在cache层的表现，运行结果变得稍稳定！

[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  39997728 , "rss": 2469 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  56191736 , "rss": 2788 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  57482730 , "rss": 2706 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  35887992 , "rss": 2545 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  38956931 , "rss": 2390 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  60337100 , "rss": 2705 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  53683998 , "rss": 2635 }
[root@localhost scalloc-artifact]# run/larson-single.sh allocators/libcmalloc.so 6 "10 7 8 1000 10000 1"
{ "threads": 6 , "ops":  57253854 , "rss": 2713 }

maokelong commented 7 years ago

发现 l2ran 函数中 cmallo 的 LLC 是 glibc 的 4 倍，此外消耗更多内存带宽。

maokelong / cmalloc

larson benchmark 下性能浮动过大 #2