Closed wanghenshui closed 7 months ago
Option Soup: the subtle pitfalls of combining compiler flags https://hacks.mozilla.org/2024/01/option-soup-the-subtle-pitfalls-of-combining-compiler-flags/ 傻逼locale问题,虽然你是静态连接libstdcxx-static,但是locale并不static
https://discourse.llvm.org/t/rfc-upstreaming-clangir/76587/19
之前聊到的MLIR 在c/c++上的落地 CIR准备合入到LLVM
gcc 7.3 bug一例 class template argument deduction fails in new-expression https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85883
template <typename T1, typename T2>
struct Bar
{
Bar(T1, T2) { }
};
int main()
{
auto x = Bar(1, 2);
auto y = new Bar(3, 4);
auto z = new Bar{3, 4};
}
Unexpected Ways Memory Subsystem Interacts with Branch Prediction
https://johnnysswlab.com/unexpected-ways-memory-subsystem-interacts-with-branch-prediction/
int binary_search(int* array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
if (st == search_type::REGULAR) {
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else
high = mid-1;
}
if (st == search_type::CONDITIONAL_MOVE) {
int middle = array[mid];
if (middle == key) {
return mid;
}
int new_low = mid + 1;
int new_high = mid - 1;
__asm__ (
"cmp %[array_middle], %[key];"
"cmovae %[new_low], %[low];"
"cmovb %[new_high], %[high];"
: [low] "+&r"(low), [high] "+&r"(high)
: [new_low] "g"(new_low), [new_high] "g"(new_high), [array_middle] "g"(middle), [key] "g"(key)
: "cc"
);
}
if (st == search_type::ARITHMETIC) {
int middle = array[mid];
if (middle == key) {
return mid;
}
int new_low = mid + 1;
int new_high = mid - 1;
int condition = array[mid] < key;
int condition_true_mask = -condition;
int condition_false_mask = -(1 - condition);
low += condition_true_mask & (new_low - low);
high += condition_false_mask & (new_high - high);
}
}
return -1;
}
Array Size (in elements) | Original | Conditional Moves | Arithmetics |
---|---|---|---|
4 K | Runtime: 0.22 sInstr: 434 M CPI: 1.96Mem. Data Volume: 0.45 GB | Runtime: 0.14 sInstr: 785 MCPI: 0.728Mem. Data Volume: 0.25 GB | Runtime: 0.19 sInstr: 1.102 MCPI: 0.69Mem. Data Volume: 0.32 GB |
16 K | Runtime: 0.26 sInstr: 511 MCPI: 2.01Mem. Data Volume: 0.49 GB | Runtime: 0.19 sInstr: 928 MCPI: 0.77Mem. Data Volume: 0.39 GB | Runtime: 0.24 sInstr: 1.308 MCPI: 0.72Mem. Data Volume: 0.46 GB |
64 K | Runtime: 0.32 sInstr: 584 MCPI: 2.143Mem. Data Volume: 0.48 GB | Runtime: 0.24 sInstr: 1.064 MCPI: 0.90Mem. Data Volume: 0.25 GB | Runtime: 0.31Instr: 1.504CPI: 0.82Mem. Data Volume: 0.26 GB |
256 K | Runtime: 0.43 sInstr: 646 MCPI: 2.59Mem. Data Volume: 0.36 GB | Runtime: 0.39 sInstr: 1.199 MCPI: 1.28Mem. Data Volume: 0.32 GB | Runtime: 0.47 sInstr: 1.698 MCPI: 1.09Mem. Data Volume: 0.36 GB |
1 M | Runtime: 0.56 sInstr: 727 MCPI: 3.05Mem. Data Volume: 0.67 GB | Runtime: 0.59 sInstr: 1.333 MCPI: 1.72Mem. Data Volume: 0.59 GB | Runtime: 0.70 sInstr: 1.891 MCPI: 1.42Mem. Data Volume: 0.68 GB |
4 M | Runtime: 1.127 sInstr: 798 MCPI: 4.65Mem. Data Volume: 9.94 GB | Runtime: 1.48 sInstr: 1.467 MCPI: 3.1Mem. Data Volume: 3.75 GB | Runtime: 1.59 sInstr: 2.084 MCPI: 2.45Mem. Data Volume: 3.9 GB |
16 M | Runtime: 1.65 sInstr: 870 MCPI: 6.26Mem. Data Volume: 18.48 GB | Runtime: 2.75 sInstr: 1.601CPI: 4.16Mem. Data Volume: 6.95 GB | Runtime: 2.90 sInstr: 2.277 MCPI: 3.18Mem. Data Volume: 7.05 GB |
课外阅读
快排代码
static int partition(std::vector<float>& vector, int low, int high) {
float pivot = vector[high];
int i = (low - 1);
for (int j = low; j < high; j++) {
if (vector[j] <= pivot) {
i++;
std::swap(vector[i], vector[j]);
}
}
i = i + 1;
std::swap(vector[i], vector[high]);
return i;
}
static int partition(std::vector<float>& vector, int low, int high) {
float* vector_i = &vector[low];
float* vector_j = &vector[low];
float* vector_end = &vector[high];
__m128 pivot = _mm_load_ss(&vector[0] + high);
while(true) {
if (vector_j >= vector_end) break;
__m128 vec_i = _mm_load_ss(vector_i);
__m128 vec_j = _mm_load_ss(vector_j);
__m128 compare = _mm_cmplt_ss(vec_j, pivot); // if (vec_j < pivot)
__m128 new_vec_i = _mm_blendv_ps(vec_i, vec_j, compare);
__m128 new_vec_j = _mm_blendv_ps(vec_j, vec_i, compare);
int increment = _mm_extract_epi32(_mm_castps_si128(compare), 0) & 0x1;
_mm_store_ss(vector_i, new_vec_i);
_mm_store_ss(vector_j, new_vec_j);
vector_i += increment;
vector_j++;
}
std::swap(*vector_i, *vector_end);
return (vector_i - &vector[0]);
}
代码在这里 https://github.com/ibogosavljevic/johnysswlab/blob/master/2022-01-sort/
https://www.cppstories.com/2024/constexpr-number-parsing-cpp23/
有句讲句 from_chars接口有点难用
auto res = std::from_chars(str.data() + start, str.data() + str.size(), result);
if (res.ec == std::errc{}) {...}
可以用结构化绑定,更好看一点
c++26可以直接这么用
auto res = std::from_chars(str.data() + start, str.data() + str.size(), result);
if (res) { ... }
https://www.meetingcpp.com/blog/items/Converting-a-string-view-to-a-time-point-in-Cpp20.html
std::chrono::sys_seconds convertToTimePoint(std::string_view fmtstring)
{
std::chrono::sys_seconds syssec;
std::istringstream in{std::string{raw_data}};//raw_data is a string_view
in >> std::chrono::parse(fmtstring, syssec);
return syssec;
}
没有std::chrono::parse?
std::chrono::sys_seconds convertToTimePoint(std::string_view fmtstring)
{
std::chrono::sys_seconds syssec;
std::istringstream in{std::string{raw_data}};//raw_data is a string_view
std::tm tm = {};
in >> std::get_time(&tm, fmtstring.data());
std::time_t time = std::mktime(&tm);
if(in.good())
syssec = std::chrono::time_point_cast< std::chrono::sys_seconds::duration>(std::chrono::system_clock::from_time_t(time));
else throw std::runtime_error(std::string("Could not convert ") + raw_data + " to a date time value with format string " + fmtstring);
return syssec;
}
popcnt的前世今生?
最近群聊里传了一个面试题
实现统计1的个数(汉明权重 hammingWeight),使用popcnt的算法对硬件不友好,有无绕过的思路
显然这个哥们的第一个实现是
int hammingWeight_popcnt(uint64_t n) {
return __builtin_popcountll(n);
}
当然c++20页支持 https://en.cppreference.com/w/cpp/numeric/popcount 一行,觉得自己很帅,当然面试官不喜欢,提示不要用popcnt,所谓的对硬件不友好指的应该是部分硬件没有这个指令
又或者性能原因?难道GPU上的popcnt性能很差?按下不表
直接贴实现
int hammingWeight(uint64_t n) {
int ret = 0;
while (n) {
n &= n - 1;
ret++;
}
return ret;
}
其实开启 O2 加上 -march=native,大家都会生成相同的popcnt, 早在2016年Lemire大哥就发现了
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
只开O2可能保守场景不会生成popcnt
如果不用popcnt,代码的性能和popcnt差距大吗?或者说,popcnt有危害吗?比如延迟高?
直接上llvm-mca分析 https://godbolt.org/z/odox8Wdr5
首先插入一个简单粗暴的教程,如何看懂llvm-mca https://llvm.org/docs/CommandGuide/llvm-mca.html
就是机器码分析器,模拟机器码执行效果,我们不用装llvm-mca,直接用godbolt内置的工具。代码已经生成好了
直接贴popcnt代码的结果吧
Iterations: 100
Instructions: 200
Total Cycles: 57
Total uOps: 200
Dispatch Width: 6
uOps Per Cycle: 3.51
IPC: 3.51
Block RThroughput: 0.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 popcnt rax, rdi
1 5 0.50 U ret
Resources:
[0] - Zn3AGU0
[1] - Zn3AGU1
[2] - Zn3AGU2
[3] - Zn3ALU0
[4] - Zn3ALU1
[5] - Zn3ALU2
[6] - Zn3ALU3
[7] - Zn3BRU1
[8] - Zn3FPP0
[9] - Zn3FPP1
[10] - Zn3FPP2
[11] - Zn3FPP3
[12.0] - Zn3FPP45
[12.1] - Zn3FPP45
[13] - Zn3FPSt
[14.0] - Zn3LSU
[14.1] - Zn3LSU
[14.2] - Zn3LSU
[15.0] - Zn3Load
[15.1] - Zn3Load
[15.2] - Zn3Load
[16.0] - Zn3Store
[16.1] - Zn3Store
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12.0] [12.1] [13] [14.0] [14.1] [14.2] [15.0] [15.1] [15.2] [16.0] [16.1]
0.33 0.33 0.34 0.50 0.33 0.33 0.34 0.50 - - - - - - - 0.33 0.33 0.34 0.33 0.33 0.34 - -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12.0] [12.1] [13] [14.0] [14.1] [14.2] [15.0] [15.1] [15.2] [16.0] [16.1] Instructions:
- - - - 0.33 0.33 0.34 - - - - - - - - - - - - - - - - popcnt rax, rdi
0.33 0.33 0.34 0.50 - - - 0.50 - - - - - - - 0.33 0.33 0.34 0.33 0.33 0.34 - - ret
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
先记下这几个数字
Iterations: 100 Instructions: 200 Total Cycles: 57 Total uOps: 200
Dispatch Width: 6 uOps Per Cycle: 3.51 IPC: 3.51 Block RThroughput: 0.5
重点参数 是IPC, uOps Per Cycle, 和 Block RThroughput (Block Reciprocal Throughput).
IPC就是模拟的总指令数字除以总cycle数 一般这个就表示吞吐了 Instructions / Total Cycles 显然这个值越高越好
Block RThroughput (Block Reciprocal Throughput) 就是Block Throughput(每个cycle能运行几次block)的倒数 就是 Total Cycles / Iterations 每次运行能用几个cycle的意思,显然,这个值越低越好
Instructions / Iterations 代表每次迭代能执行几次指令 显然 Instructions / Iterations / Block RThroughput = IPC 这个数比直接算IPC大点(有误差。。。。)你就当他是无影响的最大IPC吧 (循环数据引用问题可能导致影响。假设循环展开无影响)
uOps Per Cycle,就是模拟的微指令数总和除以总cycle数字 Total uOps/ Total Cycles ,这个和IPC含义差不多,显然这个值越高越好,但是要关注Dispatch Width - uOps Per Cycle 差值, Dispatch Width 表示最大发射指令的并行带宽,相当于资源限制了,uOps Per Cycle表示实际模拟使用的带宽,显然越接近Dispatch Width越说明资源受限制,利用率太高了,相当于CPU高了,需要找到瓶颈来源
剩下的是执行模拟以及在哪里卡了,具体分析可以用-bottleneck-analysis,得本地搞了godbolt貌似不能玩
好了,根据上面的godbolt结果,直接把数据差异对比一下
另外 网上搜到了google的两个实现,把数据补充上 https://godbolt.org/z/9nsczeT5c
int hammingWeightV2(uint64_t n) {
n -= (n >> 1) & 0x5555555555555555ULL;
n = ((n >> 2) & 0x3333333333333333ULL) + (n & 0x3333333333333333ULL);
return (((n + (n >> 4)) & 0xF0F0F0F0F0F0F0FULL)
* 0x101010101010101ULL) >> 56;
}
int hammingWeight_popcntV2(uint64_t n) {
int64_t count = 0;
asm("popcnt %1,%0" : "=r"(count) : "rm"(n) : "cc");
return count;
}
实现 | 编译器版本 | Dispatch Width | uOps/Cycle | IPC | Block RThroughput |
---|---|---|---|---|---|
popcnt | gcc 13.2 | 6 | 3.67 | 1.83 | 1.0 |
普通实现 | gcc 13.2 | 4 | 3.94 | 3.54 | 2.5 |
popcnt | clang 17.0.1 | 6 | 4.59 | 2.75 | 1.0 |
普通实现 | clang 17.0.1 | 6 | 4.78 | 3.83 | 1.7 |
popcnt v2 | gcc 13.2 | 6 | 2.14 | 2.14 | 1.3 |
手写SWAR | gcc 13.2 | 6 | 4.09 | 3.72 | 3.8 |
popcnt v2 | clang 17.0.1 | 6 | 3.67 | 1.83 | 1.0 |
手写SWAR | clang 17.0.1 | 6 | 4.09 | 3.72 | 3.8 |
能看出popcnt的Block RThroughput 低,这显然说明性能更好
然后看IPC和uOps/Cycle clang的明显比gcc的要高,但汇编说实话一个两行一个一行,这个没啥比较的意义了
重点和普通实现比,clang生成的汇编要比gcc好一点,Block RThroughput 低 IPC高,且没有特别接近Dispatch Width瓶颈
但说实话就差一个汇编这点差距根本比不出什么。只能大概说一下popcnt的汇编更少,性能更好而已
感觉SWAR这种看起来很屌, 但看mca分析感觉不太行 我跑了个qb压测,但是网站挂了,还需要本地跑一下
https://github.com/wanghenshui/little_bm/blob/dev/hamming_weight/hamming_weight.cc
qb测试来看,即使Block RThroughput 很高,但IPC也很高,性能反而非常好
不过还需要其他机器测试,我的nuc是zen3+,性能不错
github CI机器,SWAR和popcnt就差不多了。
Running ./bm_hamming_weight
Run on (4 X 2868.73 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 512 KiB (x2)
L3 Unified 32768 KiB (x1)
Load Average: 1.98, 0.54, 0.19
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_hammingWeight_popcnt/0 17.5 ns 17.5 ns 39446954
BM_hammingWeight_popcnt/128 173 ns 173 ns 4056917
BM_hammingWeight_popcnt/256 342 ns 342 ns 2051152
BM_hammingWeight_popcnt/512 679 ns 679 ns 1032384
BM_hammingWeight_popcnt/1024 1354 ns 1354 ns 517551
BM_hammingWeight/0 124 ns 124 ns 5638895
BM_hammingWeight/128 2394 ns 2394 ns 280377
BM_hammingWeight/256 5511 ns 5511 ns 123293
BM_hammingWeight/512 12036 ns 12036 ns 58336
BM_hammingWeight/1024 25711 ns 25710 ns 27149
BM_hammingWeightV2/0/0 17.8 ns 17.8 ns 39494382
BM_hammingWeightV2/128 182 ns 182 ns 3848768
BM_hammingWeightV2/256 360 ns 360 ns 1943778
BM_hammingWeightV2/512 709 ns 709 ns 988825
BM_hammingWeightV2/1024 1418 ns 1418 ns 493319
家人们,需要你们的补充测试,各种机器, 复现代码https://github.com/wanghenshui/little_bm
话说回来,数1到底能干嘛?这里要引入汉明距离 编辑距离相关的概念
简单理解就是查diff 纠错码之类的效果
popcnt的来源 http://www.talkchess.com/forum3/viewtopic.php?t=38521
上个世纪60年代,计算机还属于大型机百花齐放的年代,Control Data Corporation公司的CDC 机器卖的不错,国际象棋也在用这个软件。他们的场景就是棋盘格确认位置,所以实现了popcnt类似的能力,算位置坐标,美国国家安全局(NSA)发现了他们有这个能力,他们的新机器CDC 6000,政府采购并要求加上这个功能,主要是为了类似汉明距离之类的信息统计,相当于变相hash,用来实现校对diff之类的能力,所以也被叫做NSA Instruction (NSA指令)
这个指令也是那个时代的特殊产物把,算力不行并没有高级的hash能力,只能通过数1模拟,后来CPU性能提升渐渐的都不支持了,然后后来部分CPU支持部分CPU不支持,到现代全都捡回来
现在的CPU也有很多不支持popcnt指令,以至于游戏客户端领域会有popcnt patch之类的玩意,给玩家打patch绕过popcnt
https://github.com/ogurets/popcnt_emulator
还有什么能用到数1?
指纹?安全领域,这种更多是汉明距离场景的推广
能用到bitmap的地方,不过使用bitmap不一定非得算总数
比如 Hash Array Mapped Tries 结合tries的压缩优点 + bitmap定位槽,
bitmap浪费所以要压缩一下,位运算躲不了数1场景
再比如 Succinct Data Structures terarkdb的memtable用的就这玩意,压缩率高
关于popcnt的信息我就收集到这么多的,大家感兴趣还可以补充一下
参考
boost charconv https://lists.boost.org/Archives/boost/2024/02/255821.php
把from_chars搬到c++11
我建议放弃c++11 ,2024了bro 文档 https://develop.charconv.cpp.al/
errno and libc https://dxuuu.xyz/errno.html
errno是内核设置还是libc设置?当然是libc
怎么验证? 简单来说就是同一个系统调用,调用syscall/通过汇编调用,观察errno变化
asm