148 - Githubissues

wanghenshui commented 7 months ago

errno and libc https://dxuuu.xyz/errno.html

errno是内核设置还是libc设置？当然是libc

怎么验证？简单来说就是同一个系统调用，调用syscall/通过汇编调用，观察errno变化


static int use_wrapper(int cmd, union bpf_attr *attr, unsigned int size) {
    long ret;
    errno = 0;
    ret = syscall(__NR_bpf, cmd, attr, size);
    if (ret < 0)
        printf("wrapped syscall failed, ret=%d, errno=%d\n", ret, errno);
    else
        printf("wrapped syscall succeeded\n");
}

asm

static int use_raw(int cmd, union bpf_attr *attr, unsigned int size) {
    long ret;
    errno = 0;
    __asm__(
        "movq %1, %%rax\n"        /* syscall number */
        "movq %2, %%rdi\n"        /* arg1 */
        "movq %3, %%rsi\n"        /* arg2 */
        "movq %4, %%rdx\n"        /* arg3 */
        "syscall\n"
        "movq %%rax, %0\n"

        /* retval */
        : "=r"(ret)

        /* input operands */
        : "r"((long)__NR_bpf), "r"((long)cmd), "r"((long)attr), "r"((long)size)

        /* clobbers */
        : "rax", "rdi", "rsi", "rdx"
       );

    /* Check return value */
    if (ret < 0)
        printf("raw syscall failed, ret=%d, errno=%d\n", ret, errno);
    else
        printf("raw syscall succeeded\n");
}

wanghenshui commented 7 months ago

Option Soup: the subtle pitfalls of combining compiler flags https://hacks.mozilla.org/2024/01/option-soup-the-subtle-pitfalls-of-combining-compiler-flags/ 傻逼locale问题，虽然你是静态连接libstdcxx-static，但是locale并不static

wanghenshui commented 7 months ago

https://discourse.llvm.org/t/rfc-upstreaming-clangir/76587/19

之前聊到的MLIR 在c/c++上的落地 CIR准备合入到LLVM

wanghenshui commented 7 months ago

gcc 7.3 bug一例 class template argument deduction fails in new-expression https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85883


template <typename T1, typename T2>
struct Bar
{
    Bar(T1, T2) { }
};

int main()
{
    auto x = Bar(1, 2);
    auto y = new Bar(3, 4);
    auto z = new Bar{3, 4};
}

wanghenshui commented 7 months ago

Unexpected Ways Memory Subsystem Interacts with Branch Prediction

https://johnnysswlab.com/unexpected-ways-memory-subsystem-interacts-with-branch-prediction/

https://github.com/ibogosavljevic/johnysswlab/blob/master/2023-12-branches-memory/binary_search.cpp#L53

int binary_search(int* array, int number_of_elements, int key) {
    int low = 0, high = number_of_elements-1, mid;
    while(low <= high) {
        mid = (low + high)/2;

        if (st == search_type::REGULAR) {
            if(array[mid] < key)
                low = mid + 1; 
            else if(array[mid] == key)
                return mid;
            else
                high = mid-1;
        }

        if (st == search_type::CONDITIONAL_MOVE) {
            int middle = array[mid];
            if (middle == key) {
                    return mid;
                }

            int new_low = mid + 1;
            int new_high = mid - 1;
            __asm__ (
                "cmp %[array_middle], %[key];"
                "cmovae %[new_low], %[low];"
                "cmovb %[new_high], %[high];"
                : [low] "+&r"(low), [high] "+&r"(high)
                : [new_low] "g"(new_low), [new_high] "g"(new_high), [array_middle] "g"(middle), [key] "g"(key)
                : "cc"
            );
        }

        if (st == search_type::ARITHMETIC) {
            int middle = array[mid];
            if (middle == key) {
                return mid;
            }

            int new_low = mid + 1;
            int new_high = mid - 1;
            int condition = array[mid] < key;
            int condition_true_mask = -condition;
            int condition_false_mask = -(1 - condition);

            low += condition_true_mask & (new_low - low);
            high += condition_false_mask & (new_high - high); 

        }
    }
    return -1;
}

Array Size (in elements)	Original	Conditional Moves	Arithmetics
4 K	Runtime: 0.22 sInstr: 434 M CPI: 1.96Mem. Data Volume: 0.45 GB	Runtime: 0.14 sInstr: 785 MCPI: 0.728Mem. Data Volume: 0.25 GB	Runtime: 0.19 sInstr: 1.102 MCPI: 0.69Mem. Data Volume: 0.32 GB
16 K	Runtime: 0.26 sInstr: 511 MCPI: 2.01Mem. Data Volume: 0.49 GB	Runtime: 0.19 sInstr: 928 MCPI: 0.77Mem. Data Volume: 0.39 GB	Runtime: 0.24 sInstr: 1.308 MCPI: 0.72Mem. Data Volume: 0.46 GB
64 K	Runtime: 0.32 sInstr: 584 MCPI: 2.143Mem. Data Volume: 0.48 GB	Runtime: 0.24 sInstr: 1.064 MCPI: 0.90Mem. Data Volume: 0.25 GB	Runtime: 0.31Instr: 1.504CPI: 0.82Mem. Data Volume: 0.26 GB
256 K	Runtime: 0.43 sInstr: 646 MCPI: 2.59Mem. Data Volume: 0.36 GB	Runtime: 0.39 sInstr: 1.199 MCPI: 1.28Mem. Data Volume: 0.32 GB	Runtime: 0.47 sInstr: 1.698 MCPI: 1.09Mem. Data Volume: 0.36 GB
1 M	Runtime: 0.56 sInstr: 727 MCPI: 3.05Mem. Data Volume: 0.67 GB	Runtime: 0.59 sInstr: 1.333 MCPI: 1.72Mem. Data Volume: 0.59 GB	Runtime: 0.70 sInstr: 1.891 MCPI: 1.42Mem. Data Volume: 0.68 GB
4 M	Runtime: 1.127 sInstr: 798 MCPI: 4.65Mem. Data Volume: 9.94 GB	Runtime: 1.48 sInstr: 1.467 MCPI: 3.1Mem. Data Volume: 3.75 GB	Runtime: 1.59 sInstr: 2.084 MCPI: 2.45Mem. Data Volume: 3.9 GB
16 M	Runtime: 1.65 sInstr: 870 MCPI: 6.26Mem. Data Volume: 18.48 GB	Runtime: 2.75 sInstr: 1.601CPI: 4.16Mem. Data Volume: 6.95 GB	Runtime: 2.90 sInstr: 2.277 MCPI: 3.18Mem. Data Volume: 7.05 GB

课外阅读

快排代码

static int partition(std::vector<float>& vector, int low, int high) {
    float pivot = vector[high];
    int i = (low - 1);
    for (int j = low; j < high; j++) {
        if (vector[j] <= pivot) {
            i++;
            std::swap(vector[i], vector[j]);
        }
    }
    i = i + 1;
    std::swap(vector[i], vector[high]);
    return i;
}

    static int partition(std::vector<float>& vector, int low, int high) {
        float* vector_i = &vector[low];
        float* vector_j = &vector[low];
        float* vector_end = &vector[high];

        __m128 pivot = _mm_load_ss(&vector[0] + high);
        while(true) {
            if (vector_j >= vector_end) break;
            __m128 vec_i = _mm_load_ss(vector_i);
            __m128 vec_j = _mm_load_ss(vector_j);

            __m128 compare = _mm_cmplt_ss(vec_j, pivot); // if (vec_j < pivot)
            __m128 new_vec_i = _mm_blendv_ps(vec_i, vec_j, compare);
            __m128 new_vec_j = _mm_blendv_ps(vec_j, vec_i, compare);

            int increment = _mm_extract_epi32(_mm_castps_si128(compare), 0) & 0x1;

            _mm_store_ss(vector_i, new_vec_i);
            _mm_store_ss(vector_j, new_vec_j);

            vector_i += increment;

            vector_j++;
        }

        std::swap(*vector_i, *vector_end);
        return (vector_i - &vector[0]);
    }

代码在这里 https://github.com/ibogosavljevic/johnysswlab/blob/master/2022-01-sort/

wanghenshui commented 7 months ago

https://www.cppstories.com/2024/constexpr-number-parsing-cpp23/

有句讲句 from_chars接口有点难用

auto res = std::from_chars(str.data() + start, str.data() + str.size(), result);    
if (res.ec == std::errc{}) {...}

可以用结构化绑定，更好看一点

c++26可以直接这么用

auto res = std::from_chars(str.data() + start, str.data() + str.size(), result);    
if (res) { ... }

wanghenshui commented 7 months ago

https://github.com/gaujay/indivi_collection/blob/main/bench/sparque/README.md

wanghenshui commented 7 months ago

https://www.meetingcpp.com/blog/items/Converting-a-string-view-to-a-time-point-in-Cpp20.html

std::chrono::sys_seconds convertToTimePoint(std::string_view fmtstring)
{
    std::chrono::sys_seconds syssec;
    std::istringstream in{std::string{raw_data}};//raw_data is a string_view
    in >> std::chrono::parse(fmtstring, syssec);
    return syssec;
}

没有std::chrono::parse？

std::chrono::sys_seconds convertToTimePoint(std::string_view fmtstring)
{
    std::chrono::sys_seconds syssec;
    std::istringstream in{std::string{raw_data}};//raw_data is a string_view
    std::tm tm = {};
    in >> std::get_time(&tm, fmtstring.data());
    std::time_t time = std::mktime(&tm);
    if(in.good())
      syssec =  std::chrono::time_point_cast< std::chrono::sys_seconds::duration>(std::chrono::system_clock::from_time_t(time));
    else throw std::runtime_error(std::string("Could not convert ") + raw_data + " to a date time value with format string " + fmtstring);
    return syssec;
}

wanghenshui commented 7 months ago

popcnt的前世今生？

最近群聊里传了一个面试题

实现统计1的个数(汉明权重 hammingWeight)，使用popcnt的算法对硬件不友好，有无绕过的思路

显然这个哥们的第一个实现是

int hammingWeight_popcnt(uint64_t n) {
  return __builtin_popcountll(n);
}

当然c++20页支持 https://en.cppreference.com/w/cpp/numeric/popcount 一行，觉得自己很帅，当然面试官不喜欢，提示不要用popcnt，所谓的对硬件不友好指的应该是部分硬件没有这个指令

又或者性能原因？难道GPU上的popcnt性能很差？按下不表

直接贴实现

int hammingWeight(uint64_t n) {
    int ret = 0;
    while (n) {
        n &= n - 1;
        ret++;
    }
    return ret;
}

其实开启 O2 加上 -march=native，大家都会生成相同的popcnt, 早在2016年Lemire大哥就发现了

https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/

附上llvm检测的代码 https://github.sheincorp.cn/llvm-mirror/llvm/blob/f36485f7ac2a8d72ad0e0f2134c17fd365272285/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L960

只开O2可能保守场景不会生成popcnt

如果不用popcnt，代码的性能和popcnt差距大吗？或者说，popcnt有危害吗？比如延迟高？

直接上llvm-mca分析 https://godbolt.org/z/odox8Wdr5

首先插入一个简单粗暴的教程，如何看懂llvm-mca https://llvm.org/docs/CommandGuide/llvm-mca.html

就是机器码分析器，模拟机器码执行效果，我们不用装llvm-mca，直接用godbolt内置的工具。代码已经生成好了

直接贴popcnt代码的结果吧

Iterations:        100
Instructions:      200
Total Cycles:      57
Total uOps:        200

Dispatch Width:    6
uOps Per Cycle:    3.51
IPC:               3.51
Block RThroughput: 0.5

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        popcnt    rax, rdi
 1      5     0.50                  U     ret

Resources:
[0]   - Zn3AGU0
[1]   - Zn3AGU1
[2]   - Zn3AGU2
[3]   - Zn3ALU0
[4]   - Zn3ALU1
[5]   - Zn3ALU2
[6]   - Zn3ALU3
[7]   - Zn3BRU1
[8]   - Zn3FPP0
[9]   - Zn3FPP1
[10]  - Zn3FPP2
[11]  - Zn3FPP3
[12.0] - Zn3FPP45
[12.1] - Zn3FPP45
[13]  - Zn3FPSt
[14.0] - Zn3LSU
[14.1] - Zn3LSU
[14.2] - Zn3LSU
[15.0] - Zn3Load
[15.1] - Zn3Load
[15.2] - Zn3Load
[16.0] - Zn3Store
[16.1] - Zn3Store

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12.0] [12.1] [13]   [14.0] [14.1] [14.2] [15.0] [15.1] [15.2] [16.0] [16.1] 
0.33   0.33   0.34   0.50   0.33   0.33   0.34   0.50    -      -      -      -      -      -      -     0.33   0.33   0.34   0.33   0.33   0.34    -      -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12.0] [12.1] [13]   [14.0] [14.1] [14.2] [15.0] [15.1] [15.2] [16.0] [16.1] Instructions:
 -      -      -      -     0.33   0.33   0.34    -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -     popcnt rax, rdi
0.33   0.33   0.34   0.50    -      -      -     0.50    -      -      -      -      -      -      -     0.33   0.33   0.34   0.33   0.33   0.34    -      -     ret
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

先记下这几个数字

Iterations: 100 Instructions: 200 Total Cycles: 57 Total uOps: 200

Dispatch Width: 6 uOps Per Cycle: 3.51 IPC: 3.51 Block RThroughput: 0.5

重点参数是IPC, uOps Per Cycle, 和 Block RThroughput (Block Reciprocal Throughput).

IPC就是模拟的总指令数字除以总cycle数一般这个就表示吞吐了 Instructions / Total Cycles 显然这个值越高越好

Block RThroughput (Block Reciprocal Throughput) 就是Block Throughput(每个cycle能运行几次block)的倒数就是 Total Cycles / Iterations 每次运行能用几个cycle的意思，显然，这个值越低越好

Instructions / Iterations 代表每次迭代能执行几次指令显然 Instructions / Iterations / Block RThroughput = IPC 这个数比直接算IPC大点（有误差。。。。）你就当他是无影响的最大IPC吧 (循环数据引用问题可能导致影响。假设循环展开无影响)

uOps Per Cycle,就是模拟的微指令数总和除以总cycle数字 Total uOps/ Total Cycles ，这个和IPC含义差不多，显然这个值越高越好，但是要关注Dispatch Width - uOps Per Cycle 差值， Dispatch Width 表示最大发射指令的并行带宽，相当于资源限制了，uOps Per Cycle表示实际模拟使用的带宽，显然越接近Dispatch Width越说明资源受限制，利用率太高了，相当于CPU高了，需要找到瓶颈来源

剩下的是执行模拟以及在哪里卡了，具体分析可以用-bottleneck-analysis，得本地搞了godbolt貌似不能玩

好了，根据上面的godbolt结果，直接把数据差异对比一下

另外网上搜到了google的两个实现，把数据补充上 https://godbolt.org/z/9nsczeT5c

int hammingWeightV2(uint64_t n) {
    n -= (n >> 1) & 0x5555555555555555ULL;
    n = ((n >> 2) & 0x3333333333333333ULL) + (n & 0x3333333333333333ULL);
    return (((n + (n >> 4)) & 0xF0F0F0F0F0F0F0FULL)
            * 0x101010101010101ULL) >> 56;
}

int hammingWeight_popcntV2(uint64_t n) {
    int64_t count = 0;
    asm("popcnt %1,%0" : "=r"(count) : "rm"(n) : "cc");
    return count;
}

实现	编译器版本	Dispatch Width	uOps/Cycle	IPC	Block RThroughput
popcnt	gcc 13.2	6	3.67	1.83	1.0
普通实现	gcc 13.2	4	3.94	3.54	2.5
popcnt	clang 17.0.1	6	4.59	2.75	1.0
普通实现	clang 17.0.1	6	4.78	3.83	1.7
popcnt v2	gcc 13.2	6	2.14	2.14	1.3
手写SWAR	gcc 13.2	6	4.09	3.72	3.8
popcnt v2	clang 17.0.1	6	3.67	1.83	1.0
手写SWAR	clang 17.0.1	6	4.09	3.72	3.8

能看出popcnt的Block RThroughput 低，这显然说明性能更好

然后看IPC和uOps/Cycle clang的明显比gcc的要高，但汇编说实话一个两行一个一行，这个没啥比较的意义了

重点和普通实现比，clang生成的汇编要比gcc好一点，Block RThroughput 低 IPC高，且没有特别接近Dispatch Width瓶颈

但说实话就差一个汇编这点差距根本比不出什么。只能大概说一下popcnt的汇编更少，性能更好而已

感觉SWAR这种看起来很屌, 但看mca分析感觉不太行我跑了个qb压测，但是网站挂了，还需要本地跑一下

https://github.com/wanghenshui/little_bm/blob/dev/hamming_weight/hamming_weight.cc

qb测试来看，即使Block RThroughput 很高，但IPC也很高，性能反而非常好

不过还需要其他机器测试，我的nuc是zen3+，性能不错

github CI机器，SWAR和popcnt就差不多了。

Running ./bm_hamming_weight
Run on (4 X 2868.73 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 512 KiB (x2)
  L3 Unified 32768 KiB (x1)
Load Average: 1.98, 0.54, 0.19
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_hammingWeight_popcnt/0          17.5 ns         17.5 ns     39446954
BM_hammingWeight_popcnt/128         173 ns          173 ns      4056917
BM_hammingWeight_popcnt/256         342 ns          342 ns      2051152
BM_hammingWeight_popcnt/512         679 ns          679 ns      1032384
BM_hammingWeight_popcnt/1024       1354 ns         1354 ns       517551
BM_hammingWeight/0                  124 ns          124 ns      5638895
BM_hammingWeight/128               2394 ns         2394 ns       280377
BM_hammingWeight/256               5511 ns         5511 ns       123293
BM_hammingWeight/512              12036 ns        12036 ns        58336
BM_hammingWeight/1024             25711 ns        25710 ns        27149
BM_hammingWeightV2/0/0             17.8 ns         17.8 ns     39494382
BM_hammingWeightV2/128              182 ns          182 ns      3848768
BM_hammingWeightV2/256              360 ns          360 ns      1943778
BM_hammingWeightV2/512              709 ns          709 ns       988825
BM_hammingWeightV2/1024            1418 ns         1418 ns       493319

家人们，需要你们的补充测试，各种机器, 复现代码https://github.com/wanghenshui/little_bm

话说回来，数1到底能干嘛？这里要引入汉明距离编辑距离相关的概念

简单理解就是查diff 纠错码之类的效果

popcnt的来源 http://www.talkchess.com/forum3/viewtopic.php?t=38521

上个世纪60年代，计算机还属于大型机百花齐放的年代，Control Data Corporation公司的CDC 机器卖的不错，国际象棋也在用这个软件。他们的场景就是棋盘格确认位置，所以实现了popcnt类似的能力，算位置坐标，美国国家安全局(NSA)发现了他们有这个能力，他们的新机器CDC 6000，政府采购并要求加上这个功能，主要是为了类似汉明距离之类的信息统计，相当于变相hash，用来实现校对diff之类的能力，所以也被叫做NSA Instruction (NSA指令)

这个指令也是那个时代的特殊产物把，算力不行并没有高级的hash能力，只能通过数1模拟，后来CPU性能提升渐渐的都不支持了，然后后来部分CPU支持部分CPU不支持，到现代全都捡回来

现在的CPU也有很多不支持popcnt指令，以至于游戏客户端领域会有popcnt patch之类的玩意，给玩家打patch绕过popcnt

https://github.com/ogurets/popcnt_emulator

还有什么能用到数1？

指纹？安全领域，这种更多是汉明距离场景的推广

能用到bitmap的地方，不过使用bitmap不一定非得算总数

比如 Hash Array Mapped Tries 结合tries的压缩优点 + bitmap定位槽，

bitmap浪费所以要压缩一下，位运算躲不了数1场景

再比如 Succinct Data Structures terarkdb的memtable用的就这玩意，压缩率高

关于popcnt的信息我就收集到这么多的，大家感兴趣还可以补充一下

参考

https://vaibhavsagar.com/blog/2019/09/08/popcount/ 一些资料汇总在这里搜到的。我一开始是根据群友聊的和关键字搜到hackernews上这篇文章的分享，介绍了背景和部分应用
https://abseil.io/fast/9 这里说的话我很赞同，性能测试是个周期性的工作，可能旧的代码有时候快，后面时代/硬件进步，又慢了
https://github.com/google/supersonic/blob/master/supersonic/utils/bits.h
https://stackoverflow.com/questions/28802692/how-is-popcnt-implemented-in-hardware

wanghenshui commented 7 months ago

boost charconv https://lists.boost.org/Archives/boost/2024/02/255821.php

把from_chars搬到c++11

我建议放弃c++11 ，2024了bro 文档 https://develop.charconv.cpp.al/

wanghenshui / cppweeklynews

148 #93