mklarqvist / positional-popcount

Fast C functions for the computing the positional popcount (pospopcnt).
Apache License 2.0
53 stars 5 forks source link

Horizontal sum of 16-bit counters #11

Closed WojciechMula closed 5 years ago

WojciechMula commented 5 years ago

This is slightly faster[1] code, but requires the AVX512VBMI extension. Can we assume this extension? I also checked four-fold sum, but couldn't measure any further improvement.

[1] Our tool instrumented_benchmark shows significant variations of measurements. On CNL I got 0.134 to 0.151 cycles per word, thus I'm not sure if it's faster. Current baseline is 0.148 cycles.

lemire commented 5 years ago

For correct analysis, I recommend running the instrumented benchmark with the -v flag for verbose outputs, you something like this...

pospopcnt_u16_avx512_csa                    instructions per cycle 1.93, cycles per 16-bit word:  0.154, instructions per 16-bit word 0.298
min:     7970 cycles,    15417 instructions,           1 branch mis.,      540 cache ref.,        0 cache mis.
avg:   8155.0 cycles,  15417.0 instructions,         1.2 branch mis.,    644.1 cache ref.,      1.6 cache mis.

See that the average and min. cycles are different by about 2%? This gives you an estimate of your incertitude of that measure.

You can also run the test multiple times...

$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.151
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.155
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.153
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.156
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.151

So the values range between 0.151 and 0.156... so about a 3% incertitude.

You definitively do not want to get something that goes from 0.134 to 0.151 cycles per word for the same algorithm. Let me know if this happens and it is one of my test machines. Sometimes they get misconfigured.

mklarqvist commented 5 years ago

@WojciechMula I think you put the AVX-512 code in the AVX-2 subroutine?

mklarqvist commented 5 years ago

If moved to the correct AVX-512 subroutine I get the following results (10 runs):

pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.144 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.139 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.142 
avx512popcnt                                cycles per 16-bit word:  0.089 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.152 
avx512popcnt                                cycles per 16-bit word:  0.101 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.145 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.145 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.142 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.150 
avx512popcnt                                cycles per 16-bit word:  0.103 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.142 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.150 
avx512popcnt                                cycles per 16-bit word:  0.097
WojciechMula commented 5 years ago

@mklarqvist in this MR indeed I made a mistak, but on the AVX512 machine I have got the correct code :)

WojciechMula commented 5 years ago

Seems we use exactly the same machine.

mklarqvist commented 5 years ago

@WojciechMula I think I used @lemire i3-8121U (Sky Lake) machine for these results. This is a good pull request though! Shaves off a few instructions and moves us very, very, close to the magic 1.5-fold ratio compared to popcnt.

mklarqvist commented 5 years ago

Hmmmmmmmmmm. If we have a large number of data points we get close to parity with popcnt.

$ for i in {1..25}; do ./instrumented_benchmark -n 10000000000 | tail -n 3 | head -n 2; done
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.106 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.101 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.116 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.116 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.098 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.120 
avx512popcnt                                cycles per 16-bit word:  0.101 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.110 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.122 
avx512popcnt                                cycles per 16-bit word:  0.103 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.115 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.113 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.119 
avx512popcnt                                cycles per 16-bit word:  0.098 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.115 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.122 
avx512popcnt                                cycles per 16-bit word:  0.104 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.119 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.119 
avx512popcnt                                cycles per 16-bit word:  0.098 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.113 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.120 
avx512popcnt                                cycles per 16-bit word:  0.098 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.126 
avx512popcnt                                cycles per 16-bit word:  0.105 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.116 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.110 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.123 
avx512popcnt                                cycles per 16-bit word:  0.101 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.127 
avx512popcnt                                cycles per 16-bit word:  0.109 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.113 
avx512popcnt                                cycles per 16-bit word:  0.090
lemire commented 5 years ago

@mklarqvist i3-8121U is Cannon Lake.

I also have a Skylake-X... The Cannon Lake is different.

Skylake-X has downclocking whereas this Cannon Lake does not... Please read this blog post with care...

https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

lemire commented 5 years ago

@mklarqvist It is normal that the cost per unit of work goes down when you have large inputs because the fixed overhead goes to zero (on a per unit of work basis), but I am not sure I understand why it would keep speeding up once you reach a certain point.

From first principles, the speed per unit of work should go up when you go from tiny inputs to more sizeable ones, but then, afterward, the speed should either remain flat or go up with memory access bottlenecks... It should not keep on going fast and faster.

lemire commented 5 years ago

Something is off if our speed keeps on improving the larger the input gets. There should be limits.

lemire commented 5 years ago

Could the machine start caching partial results when repeatedly computing on similar-sized datasets?

Where would these results get cached to? These are distinct processes.

lemire commented 5 years ago

I just synced with master and ran it...

$ for i in {1..25}; do ./instrumented_benchmark -n 1000000000 | tail -n 3 | head -n 2; done
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.130
avx512popcnt                                cycles per 16-bit word:  0.085
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.145
avx512popcnt                                cycles per 16-bit word:  0.096
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.130
avx512popcnt                                cycles per 16-bit word:  0.088
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.140
avx512popcnt                                cycles per 16-bit word:  0.099
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.130
avx512popcnt                                cycles per 16-bit word:  0.089
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.141
avx512popcnt                                cycles per 16-bit word:  0.095
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.133
avx512popcnt                                cycles per 16-bit word:  0.090
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.149
avx512popcnt                                cycles per 16-bit word:  0.101
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.135
avx512popcnt                                cycles per 16-bit word:  0.097
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.148
avx512popcnt                                cycles per 16-bit word:  0.109
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.142
avx512popcnt                                cycles per 16-bit word:  0.094
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.133
avx512popcnt                                cycles per 16-bit word:  0.086
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.147
avx512popcnt                                cycles per 16-bit word:  0.096
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.140
avx512popcnt                                cycles per 16-bit word:  0.103
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.134
avx512popcnt                                cycles per 16-bit word:  0.094
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.138
avx512popcnt                                cycles per 16-bit word:  0.101
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.134
avx512popcnt                                cycles per 16-bit word:  0.090
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.143
avx512popcnt                                cycles per 16-bit word:  0.092
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.144
avx512popcnt                                cycles per 16-bit word:  0.104
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.141
avx512popcnt                                cycles per 16-bit word:  0.089
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.133
avx512popcnt                                cycles per 16-bit word:  0.094
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.139
avx512popcnt                                cycles per 16-bit word:  0.103
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.136
avx512popcnt                                cycles per 16-bit word:  0.099
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.131
avx512popcnt                                cycles per 16-bit word:  0.093
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.146
avx512popcnt                                cycles per 16-bit word:  0.091

There are run-to-run variations, sometimes higher than I'd like, but there is no clear trend toward greater speed.

lemire commented 5 years ago

Note that I did not run my tests with this PR merged.

If this PR brings about magical "caching" of the results in a process-to-process manner... I'd hurry to patent it. It is maybe some really cool new AI technology.

lemire commented 5 years ago

@mklarqvist

I don't think 1 billion should overflow 32-bit counters... should it?

mklarqvist commented 5 years ago

This code has been merge during this conversation. I will close this PR. I do still get reproducible (deflated) throughputs with large number of input values though:

$ for i in {250000000,500000000,1000000000}; do for j in {1..25}; do ./instrumented_benchmark -n $i -i 100 | tail -n 3 | head -n 2; done; done
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.141 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.133 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.140 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.133 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.130 
avx512popcnt                                cycles per 16-bit word:  0.086 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.130 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.137 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.144 
avx512popcnt                                cycles per 16-bit word:  0.103 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.134 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.136 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.139 
avx512popcnt                                cycles per 16-bit word:  0.098 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.132 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.143 
avx512popcnt                                cycles per 16-bit word:  0.099 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.137 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.134 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.139 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.144 
avx512popcnt                                cycles per 16-bit word:  0.100 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.135 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.131 
avx512popcnt                                cycles per 16-bit word:  0.086 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.140 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.139 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.135 
avx512popcnt                                cycles per 16-bit word:  0.089 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.135 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.145 
avx512popcnt                                cycles per 16-bit word:  0.099 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.137 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.175 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.181 
avx512popcnt                                cycles per 16-bit word:  0.101 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.173 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.171 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.173 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.179 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.174 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.175 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.172 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.174 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.175 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.183 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.175 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.174 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.171 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.179 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.184 
avx512popcnt                                cycles per 16-bit word:  0.102 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.174 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.176 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.120 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.116 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.119 
avx512popcnt                                cycles per 16-bit word:  0.097 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.117 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.117 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.117 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.095 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.092 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.115 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.087 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.114 
avx512popcnt                                cycles per 16-bit word:  0.090 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.108 
avx512popcnt                                cycles per 16-bit word:  0.085 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.113 
avx512popcnt                                cycles per 16-bit word:  0.091 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.117 
avx512popcnt                                cycles per 16-bit word:  0.089 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.117 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.120 
avx512popcnt                                cycles per 16-bit word:  0.100 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.118 
avx512popcnt                                cycles per 16-bit word:  0.093 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.119 
avx512popcnt                                cycles per 16-bit word:  0.096 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.122 
avx512popcnt                                cycles per 16-bit word:  0.099 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.121 
avx512popcnt                                cycles per 16-bit word:  0.100 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.116 
avx512popcnt                                cycles per 16-bit word:  0.094 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.124 
avx512popcnt                                cycles per 16-bit word:  0.098

I will investigate.