mpark / variant

C++17 `std::variant` for C++11/14/17
https://mpark.github.io/variant
Boost Software License 1.0
659 stars 88 forks source link

Replaced the visitation mechanism to be `switch`-based. #56

Closed mpark closed 5 years ago

mpark commented 5 years ago

Follow-up to #52 CC @quicknir

mpark commented 5 years ago

The current C++11 implementation using a chain of ternary operators gets optimized nicely under Clang >= 4.0, but does not work well for any GCC.

Clang 3.9: https://travis-ci.org/mpark/variant/jobs/474776453

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:15:54
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2300 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 46080K (x1)
5: Load Average: 1.30, 1.00, 0.46
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         2801 ns       2801 ns     249965
5: visit1<2>         9519 ns       9519 ns      73370
5: visit1<3>        11382 ns      11382 ns      61655
5: visit1<5>        11409 ns      11409 ns      61178
5: visit1<8>        10320 ns      10320 ns      67497
5: visit1<15>       10483 ns      10483 ns      66962
5: visit1<32>       12492 ns      12492 ns      55995
5: visit1<65>       11254 ns      11253 ns      62465
5: visit1<128>      10673 ns      10673 ns      65601
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.60 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:16:02
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2300 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 46080K (x1)
6: Load Average: 1.25, 1.00, 0.47
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         2843 ns       2842 ns     248350
6: visit1<2>         4347 ns       4347 ns     161527
6: visit1<3>         5755 ns       5755 ns     121301
6: visit1<5>         7205 ns       7205 ns      96040
6: visit1<8>         7921 ns       7921 ns      89186
6: visit1<15>       11592 ns      11592 ns      60326
6: visit1<32>       19671 ns      19609 ns      36296
6: visit1<65>       40742 ns      40741 ns      17328
6: visit1<128>      72147 ns      72147 ns       9872

Clang 4.0: https://travis-ci.org/mpark/variant/jobs/474776455

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:18:31
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2300 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 46080K (x1)
5: Load Average: 1.22, 1.08, 0.52
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         2868 ns       2868 ns     241164
5: visit1<2>        10869 ns      10869 ns      64441
5: visit1<3>        10247 ns      10247 ns      69007
5: visit1<5>        11963 ns      11963 ns      58924
5: visit1<8>        11442 ns      11442 ns      60372
5: visit1<15>       10584 ns      10583 ns      64373
5: visit1<32>       10750 ns      10750 ns      66075
5: visit1<65>       12604 ns      12604 ns      56448
5: visit1<128>      19759 ns      19759 ns      34593
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.72 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:18:38
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2300 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 46080K (x1)
6: Load Average: 1.21, 1.07, 0.52
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         2896 ns       2896 ns     242707
6: visit1<2>         4413 ns       4412 ns     159670
6: visit1<3>         3852 ns       3851 ns     179790
6: visit1<5>         2977 ns       2977 ns     235361
6: visit1<8>         2954 ns       2954 ns     234164
6: visit1<15>        3022 ns       3021 ns     232849
6: visit1<32>        3833 ns       3833 ns     181819
6: visit1<65>        4925 ns       4924 ns     143037
6: visit1<128>       3833 ns       3833 ns     180677

GCC 4.9: https://travis-ci.org/mpark/variant/jobs/474776437

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 09:56:50
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2300 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 46080K (x1)
5: Load Average: 1.09, 0.87, 0.41
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         3790 ns       3789 ns     183728
5: visit1<2>        11373 ns      11373 ns      61747
5: visit1<3>        12018 ns      12018 ns      58302
5: visit1<5>        10370 ns      10370 ns      68172
5: visit1<8>        10569 ns      10569 ns      65518
5: visit1<15>       11400 ns      11400 ns      61299
5: visit1<32>       12453 ns      12436 ns      55540
5: visit1<65>       10548 ns      10545 ns      66160
5: visit1<128>      10768 ns      10767 ns      65310
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.73 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 09:56:58
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2300 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 46080K (x1)
6: Load Average: 1.08, 0.87, 0.42
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         6873 ns       6873 ns     101490
6: visit1<2>         4174 ns       4174 ns     168032
6: visit1<3>         6520 ns       6520 ns     107322
6: visit1<5>         8068 ns       8068 ns      86855
6: visit1<8>         9532 ns       9532 ns      74410
6: visit1<15>       11307 ns      11307 ns      61789
6: visit1<32>       20581 ns      20581 ns      34392
6: visit1<65>       38014 ns      38015 ns      18733
6: visit1<128>      69736 ns      69735 ns       9869

GCC 8: https://travis-ci.org/mpark/variant/jobs/474776448

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:09:28
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2300 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 46080K (x1)
5: Load Average: 1.15, 1.04, 0.50
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         3805 ns       3805 ns     182961
5: visit1<2>        10565 ns      10565 ns      66086
5: visit1<3>        11357 ns      11357 ns      61588
5: visit1<5>        11736 ns      11736 ns      59659
5: visit1<8>        12278 ns      12278 ns      57865
5: visit1<15>       12271 ns      12270 ns      57095
5: visit1<32>       11354 ns      11354 ns      61476
5: visit1<65>       12317 ns      12316 ns      56952
5: visit1<128>      10754 ns      10754 ns      64382
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.76 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:09:36
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2300 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 46080K (x1)
6: Load Average: 1.14, 1.04, 0.50
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         5757 ns       5757 ns     121563
6: visit1<2>         4699 ns       4699 ns     149398
6: visit1<3>         6353 ns       6353 ns     109737
6: visit1<5>         7424 ns       7424 ns      93428
6: visit1<8>         8623 ns       8623 ns      81464
6: visit1<15>       13569 ns      13568 ns      50929
6: visit1<32>       20769 ns      20769 ns      33469
6: visit1<65>       38117 ns      38116 ns      18302
6: visit1<128>      69499 ns      69499 ns       9904
mpark commented 5 years ago

The current C++14/17 implementation of recursive switch works well for all of Clang, but has some issues under GCC. Clang 7: https://travis-ci.org/mpark/variant/jobs/474776465

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:31:18
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2500 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 30720K (x1)
5: Load Average: 1.32, 1.07, 0.51
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         2864 ns       2862 ns     239638
5: visit1<2>        11464 ns      11455 ns      61048
5: visit1<3>        10244 ns      10239 ns      68836
5: visit1<5>         9989 ns       9981 ns      70500
5: visit1<8>        11452 ns      11444 ns      60595
5: visit1<15>       12330 ns      12321 ns      56714
5: visit1<32>       10617 ns      10605 ns      66692
5: visit1<65>       10673 ns      10667 ns      65805
5: visit1<128>      14185 ns      14174 ns      48735
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.64 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:31:25
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2500 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 30720K (x1)
6: Load Average: 1.30, 1.07, 0.52
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         2850 ns       2846 ns     245817
6: visit1<2>         3436 ns       3434 ns     201585
6: visit1<3>         2940 ns       2939 ns     236889
6: visit1<5>         3838 ns       3836 ns     182712
6: visit1<8>         2973 ns       2972 ns     236992
6: visit1<15>        2957 ns       2956 ns     237050
6: visit1<32>        2951 ns       2949 ns     237763
6: visit1<65>        2981 ns       2980 ns     235619
6: visit1<128>       9646 ns       9641 ns      72540

GCC 8: https://travis-ci.org/mpark/variant/jobs/474776450

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:11:32
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2500 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 30720K (x1)
5: Load Average: 1.14, 0.89, 0.41
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         2799 ns       2798 ns     250365
5: visit1<2>         9468 ns       9463 ns      73890
5: visit1<3>        11481 ns      11475 ns      61645
5: visit1<5>        11572 ns      11566 ns      60407
5: visit1<8>        11592 ns      11586 ns      60486
5: visit1<15>       11585 ns      11579 ns      58922
5: visit1<32>       10668 ns      10663 ns      65947
5: visit1<65>       10632 ns      10626 ns      64057
5: visit1<128>      11657 ns      11651 ns      54142
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.54 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:11:40
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2500 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 30720K (x1)
6: Load Average: 1.11, 0.90, 0.41
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         2827 ns       2825 ns     248203
6: visit1<2>         4179 ns       4176 ns     166883
6: visit1<3>         2858 ns       2856 ns     241306
6: visit1<5>         2858 ns       2856 ns     241945
6: visit1<8>         3826 ns       3823 ns     181854
6: visit1<15>        2850 ns       2848 ns     244742
6: visit1<32>        3830 ns       3827 ns     182111
6: visit1<65>        6870 ns       6865 ns     102572
6: visit1<128>      12199 ns      12164 ns      58194

GCC >= 7 perform well compared to the manual jump table, but it does not inline/collapse the recursive switch. This is reflected in our numbers in the progression from visit<32>, visit<65> to visit<128>. It can also be seen concretely in the generated assembly: https://godbolt.org/z/Rfcv9C

GCC 6: https://travis-ci.org/mpark/variant/jobs/474776444

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-03 10:04:26
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2500 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 30720K (x1)
5: Load Average: 1.25, 0.94, 0.43
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<1>         2807 ns       2805 ns     249207
5: visit1<2>         9514 ns       9514 ns      72992
5: visit1<3>        11995 ns      11995 ns      58128
5: visit1<5>        10296 ns      10296 ns      68163
5: visit1<8>        10375 ns      10375 ns      67930
5: visit1<15>       10541 ns      10541 ns      64657
5: visit1<32>       10666 ns      10666 ns      63728
5: visit1<65>       10516 ns      10516 ns      66144
5: visit1<128>      12532 ns      12532 ns      57244
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    7.58 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-03 10:04:34
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2500 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 30720K (x1)
6: Load Average: 1.21, 0.95, 0.44
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<1>         7626 ns       7615 ns      90840
6: visit1<2>         7466 ns       7462 ns      94088
6: visit1<3>        10818 ns      10818 ns      64757
6: visit1<5>        11162 ns      11160 ns      63228
6: visit1<8>        10955 ns      10953 ns      63846
6: visit1<15>       11100 ns      11098 ns      63626
6: visit1<32>       11341 ns      11338 ns      61223
6: visit1<65>       12561 ns      12561 ns      55673
6: visit1<128>      16291 ns      16290 ns      42286

GCC < 7 produces a jump table of size 32 (our switch size) even when unneeded. This can be seen for example, in the generated assembly for visit<8>: https://godbolt.org/z/8XKQGD. Although it seems like it's not actually worse than the manual jump table aside from the visit<1> case 🤔

mpark commented 5 years ago

Follow-up on the current C++14/17 implementation with regards to GCC >= 7.

I was concerned that the non-inlined recursive switch would grow faster than the manual jump table approach, but having tested for visit<200> and visit<250>, the manual jump table performance also grows, at approximately the same rate it seems.

GCC 8: https://travis-ci.org/mpark/variant/jobs/475164795

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-04 05:49:01
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2600 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 20480K (x1)
5: Load Average: 1.00, 1.00, 0.70
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<200>      17584 ns      17574 ns      40196
5: visit1<250>      20115 ns      20105 ns      34640
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    1.84 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-04 05:49:03
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2600 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 20480K (x1)
6: Load Average: 1.00, 1.00, 0.70
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<200>      16236 ns      16227 ns      43488
6: visit1<250>      18135 ns      18126 ns      38061

This is also roughly the case for GCC < 7.

GCC 6: https://travis-ci.org/mpark/variant/jobs/475164789

5: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark "--benchmark_color=true"
5: Test timeout computed to be: 10000000
5: 2019-01-04 05:34:19
5: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark
5: Run on (2 X 2600 MHz CPU s)
5: CPU Caches:
5:   L1 Data 32K (x1)
5:   L1 Instruction 32K (x1)
5:   L2 Unified 256K (x1)
5:   L3 Unified 20480K (x1)
5: Load Average: 1.40, 1.13, 0.78
5: ---------------------------------------------------
5: Benchmark            Time           CPU Iterations
5: ---------------------------------------------------
5: visit1<200>      15825 ns      15825 ns      43707
5: visit1<250>      20641 ns      20641 ns      32384
5: 
1/2 Test #5: execute.visit.1.mpark ............   Passed    1.80 sec
test 6
    Start 6: execute.visit.1.mpark-dev
6: Test command: /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev "--benchmark_color=true"
6: Test timeout computed to be: 10000000
6: 2019-01-04 05:34:20
6: Running /home/travis/build/mpark/variant/build/visit.1/execute.visit.1.mpark-dev
6: Run on (2 X 2600 MHz CPU s)
6: CPU Caches:
6:   L1 Data 32K (x1)
6:   L1 Instruction 32K (x1)
6:   L2 Unified 256K (x1)
6:   L3 Unified 20480K (x1)
6: Load Average: 1.40, 1.13, 0.78
6: ---------------------------------------------------
6: Benchmark            Time           CPU Iterations
6: ---------------------------------------------------
6: visit1<200>      18900 ns      18889 ns      37390
6: visit1<250>      20114 ns      20104 ns      34828

It seems safe to me to proceed with the current implementation, for C++14/17.

mpark commented 5 years ago

I've reverted the C++11 implementation to be the manual jump table approach, and only updated C++14/17 implementation to be recursive switch.

The following is the full matrix of the benchmarks for visit.1 https://travis-ci.org/mpark/variant/builds/476107115