Closed WojciechMula closed 5 years ago
For correct analysis, I recommend running the instrumented benchmark with the -v
flag for verbose outputs, you something like this...
pospopcnt_u16_avx512_csa instructions per cycle 1.93, cycles per 16-bit word: 0.154, instructions per 16-bit word 0.298
min: 7970 cycles, 15417 instructions, 1 branch mis., 540 cache ref., 0 cache mis.
avg: 8155.0 cycles, 15417.0 instructions, 1.2 branch mis., 644.1 cache ref., 1.6 cache mis.
See that the average and min. cycles are different by about 2%? This gives you an estimate of your incertitude of that measure.
You can also run the test multiple times...
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.151
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.152
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.155
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.153
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.156
$ ./instrumented_benchmark | grep pospopcnt_u16_avx512_csa
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.151
So the values range between 0.151 and 0.156... so about a 3% incertitude.
You definitively do not want to get something that goes from 0.134 to 0.151 cycles per word for the same algorithm. Let me know if this happens and it is one of my test machines. Sometimes they get misconfigured.
@WojciechMula I think you put the AVX-512 code in the AVX-2 subroutine?
If moved to the correct AVX-512 subroutine I get the following results (10 runs):
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.144
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.139
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.142
avx512popcnt cycles per 16-bit word: 0.089
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.152
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.145
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.145
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.142
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.150
avx512popcnt cycles per 16-bit word: 0.103
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.142
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.150
avx512popcnt cycles per 16-bit word: 0.097
@mklarqvist in this MR indeed I made a mistak, but on the AVX512 machine I have got the correct code :)
Seems we use exactly the same machine.
@WojciechMula I think I used @lemire i3-8121U (Sky Lake) machine for these results. This is a good pull request though! Shaves off a few instructions and moves us very, very, close to the magic 1.5-fold ratio compared to popcnt.
Hmmmmmmmmmm. If we have a large number of data points we get close to parity with popcnt.
$ for i in {1..25}; do ./instrumented_benchmark -n 10000000000 | tail -n 3 | head -n 2; done
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.106
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.116
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.116
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.098
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.120
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.110
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.122
avx512popcnt cycles per 16-bit word: 0.103
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.115
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.113
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.119
avx512popcnt cycles per 16-bit word: 0.098
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.115
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.122
avx512popcnt cycles per 16-bit word: 0.104
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.119
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.119
avx512popcnt cycles per 16-bit word: 0.098
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.113
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.120
avx512popcnt cycles per 16-bit word: 0.098
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.126
avx512popcnt cycles per 16-bit word: 0.105
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.116
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.110
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.123
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.127
avx512popcnt cycles per 16-bit word: 0.109
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.113
avx512popcnt cycles per 16-bit word: 0.090
@mklarqvist i3-8121U is Cannon Lake.
I also have a Skylake-X... The Cannon Lake is different.
Skylake-X has downclocking whereas this Cannon Lake does not... Please read this blog post with care...
https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
@mklarqvist It is normal that the cost per unit of work goes down when you have large inputs because the fixed overhead goes to zero (on a per unit of work basis), but I am not sure I understand why it would keep speeding up once you reach a certain point.
From first principles, the speed per unit of work should go up when you go from tiny inputs to more sizeable ones, but then, afterward, the speed should either remain flat or go up with memory access bottlenecks... It should not keep on going fast and faster.
Something is off if our speed keeps on improving the larger the input gets. There should be limits.
Could the machine start caching partial results when repeatedly computing on similar-sized datasets?
Where would these results get cached to? These are distinct processes.
I just synced with master and ran it...
$ for i in {1..25}; do ./instrumented_benchmark -n 1000000000 | tail -n 3 | head -n 2; done
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.130
avx512popcnt cycles per 16-bit word: 0.085
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.145
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.130
avx512popcnt cycles per 16-bit word: 0.088
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.140
avx512popcnt cycles per 16-bit word: 0.099
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.130
avx512popcnt cycles per 16-bit word: 0.089
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.141
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.133
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.149
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.135
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.148
avx512popcnt cycles per 16-bit word: 0.109
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.142
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.133
avx512popcnt cycles per 16-bit word: 0.086
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.147
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.140
avx512popcnt cycles per 16-bit word: 0.103
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.134
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.138
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.134
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.143
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.144
avx512popcnt cycles per 16-bit word: 0.104
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.141
avx512popcnt cycles per 16-bit word: 0.089
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.133
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.139
avx512popcnt cycles per 16-bit word: 0.103
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.136
avx512popcnt cycles per 16-bit word: 0.099
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.131
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.146
avx512popcnt cycles per 16-bit word: 0.091
There are run-to-run variations, sometimes higher than I'd like, but there is no clear trend toward greater speed.
Note that I did not run my tests with this PR merged.
If this PR brings about magical "caching" of the results in a process-to-process manner... I'd hurry to patent it. It is maybe some really cool new AI technology.
@mklarqvist
I don't think 1 billion should overflow 32-bit counters... should it?
This code has been merge during this conversation. I will close this PR. I do still get reproducible (deflated) throughputs with large number of input values though:
$ for i in {250000000,500000000,1000000000}; do for j in {1..25}; do ./instrumented_benchmark -n $i -i 100 | tail -n 3 | head -n 2; done; done
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.141
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.133
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.140
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.133
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.130
avx512popcnt cycles per 16-bit word: 0.086
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.130
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.137
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.144
avx512popcnt cycles per 16-bit word: 0.103
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.134
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.136
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.139
avx512popcnt cycles per 16-bit word: 0.098
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.132
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.143
avx512popcnt cycles per 16-bit word: 0.099
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.137
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.134
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.139
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.144
avx512popcnt cycles per 16-bit word: 0.100
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.135
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.131
avx512popcnt cycles per 16-bit word: 0.086
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.140
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.139
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.135
avx512popcnt cycles per 16-bit word: 0.089
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.135
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.145
avx512popcnt cycles per 16-bit word: 0.099
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.137
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.175
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.181
avx512popcnt cycles per 16-bit word: 0.101
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.173
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.171
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.173
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.179
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.174
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.175
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.172
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.174
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.175
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.183
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.175
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.174
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.171
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.179
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.184
avx512popcnt cycles per 16-bit word: 0.102
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.174
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.176
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.120
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.116
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.119
avx512popcnt cycles per 16-bit word: 0.097
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.117
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.117
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.117
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.095
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.092
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.115
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.087
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.114
avx512popcnt cycles per 16-bit word: 0.090
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.108
avx512popcnt cycles per 16-bit word: 0.085
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.113
avx512popcnt cycles per 16-bit word: 0.091
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.117
avx512popcnt cycles per 16-bit word: 0.089
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.117
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.120
avx512popcnt cycles per 16-bit word: 0.100
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.118
avx512popcnt cycles per 16-bit word: 0.093
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.119
avx512popcnt cycles per 16-bit word: 0.096
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.122
avx512popcnt cycles per 16-bit word: 0.099
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.121
avx512popcnt cycles per 16-bit word: 0.100
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.116
avx512popcnt cycles per 16-bit word: 0.094
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.124
avx512popcnt cycles per 16-bit word: 0.098
I will investigate.
This is slightly faster[1] code, but requires the AVX512VBMI extension. Can we assume this extension? I also checked four-fold sum, but couldn't measure any further improvement.
[1] Our tool
instrumented_benchmark
shows significant variations of measurements. On CNL I got 0.134 to 0.151 cycles per word, thus I'm not sure if it's faster. Current baseline is 0.148 cycles.