tayloraswift / swift-noise

Generate and compose commonly-used procedural noises and distributions, in pure Swift
https://swiftinit.org/docs/swift-noise/noise
Apache License 2.0
116 stars 11 forks source link

replacing tuples with SIMD - DONT MERGE - seems to be ~40+% SLOWER #18

Open heckj opened 3 months ago

heckj commented 3 months ago

Since we were talking about this, I took the time to set it up - but after all the conversions, it turns out thats only HURT performance (based on benchmark comparison).

swift package benchmark baseline compare bdb4ef08 --format markdown:

Comparing results between 'bdb4ef08' and 'Current_run'

Host 'Sparrow.local' with 8 'arm64' processors with 16 GB memory, running:
Darwin Kernel Version 23.5.0: Wed May  1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103

ExternalBenchmarks

cell2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 458 | 583 | 583 | 584 | 625 | 625 | 38292 | 1048576 | | Current_run | 708 | 792 | 792 | 833 | 834 | 875 | 54416 | 923477 | | Δ | 250 | 209 | 209 | 249 | 209 | 250 | 16124 | -125099 | | Improvement % | -55 | -36 | -36 | -43 | -33 | -40 | -42 | -125099 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 2183 | 1716 | 1716 | 1713 | 1601 | 1601 | 26 | 1048576 | | Current_run | 1412 | 1264 | 1264 | 1201 | 1199 | 1144 | 18 | 923477 | | Δ | -771 | -452 | -452 | -512 | -402 | -457 | -8 | -125099 | | Improvement % | -35 | -26 | -26 | -30 | -25 | -29 | -31 | -125099 |

cell3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 458 | 542 | 583 | 583 | 584 | 625 | 32667 | 1048576 | | Current_run | 708 | 792 | 792 | 833 | 834 | 875 | 51083 | 922510 | | Δ | 250 | 250 | 209 | 250 | 250 | 250 | 18416 | -126066 | | Improvement % | -55 | -46 | -36 | -43 | -43 | -40 | -56 | -126066 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 2183 | 1845 | 1716 | 1716 | 1713 | 1601 | 31 | 1048576 | | Current_run | 1412 | 1264 | 1264 | 1201 | 1199 | 1144 | 20 | 922510 | | Δ | -771 | -581 | -452 | -515 | -514 | -457 | -11 | -126066 | | Improvement % | -35 | -31 | -26 | -30 | -30 | -29 | -35 | -126066 |

cell_tiling3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 458 | 583 | 583 | 584 | 625 | 666 | 66667 | 1048576 | | Current_run | 708 | 792 | 792 | 833 | 834 | 875 | 51625 | 916598 | | Δ | 250 | 209 | 209 | 249 | 209 | 209 | -15042 | -131978 | | Improvement % | -55 | -36 | -36 | -43 | -33 | -31 | 23 | -131978 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 2183 | 1716 | 1716 | 1713 | 1601 | 1502 | 15 | 1048576 | | Current_run | 1412 | 1264 | 1264 | 1201 | 1199 | 1144 | 19 | 916598 | | Δ | -771 | -452 | -452 | -512 | -402 | -358 | 4 | -131978 | | Improvement % | -35 | -26 | -26 | -30 | -25 | -24 | 27 | -131978 |

classic3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 6750 | 6875 | 6919 | 7127 | 7211 | 7543 | 72959 | 138634 | | Current_run | 9875 | 10047 | 10087 | 10087 | 10127 | 10295 | 60750 | 95852 | | Δ | 3125 | 3172 | 3168 | 2960 | 2916 | 2752 | -12209 | -42782 | | Improvement % | -46 | -46 | -46 | -42 | -40 | -36 | 17 | -42782 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 148 | 146 | 145 | 140 | 139 | 133 | 14 | 138634 | | Current_run | 101 | 100 | 99 | 99 | 99 | 97 | 16 | 95852 | | Δ | -47 | -46 | -46 | -41 | -40 | -36 | 2 | -42782 | | Improvement % | -32 | -32 | -32 | -29 | -29 | -27 | 14 | -42782 |

classic_tiling3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 708 | 792 | 833 | 833 | 834 | 916 | 35958 | 1009758 | | Current_run | 1083 | 1208 | 1208 | 1209 | 1250 | 1292 | 47791 | 669209 | | Δ | 375 | 416 | 375 | 376 | 416 | 376 | 11833 | -340549 | | Improvement % | -53 | -53 | -45 | -45 | -50 | -41 | -33 | -340549 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 1412 | 1264 | 1201 | 1201 | 1199 | 1093 | 28 | 1009758 | | Current_run | 923 | 828 | 828 | 827 | 800 | 774 | 21 | 669209 | | Δ | -489 | -436 | -373 | -374 | -399 | -319 | -7 | -340549 | | Improvement % | -35 | -34 | -31 | -31 | -33 | -29 | -25 | -340549 |

classic_tiling_fbm3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 6833 | 6959 | 7003 | 7003 | 7087 | 7583 | 47125 | 138651 | | Current_run | 9958 | 10127 | 10167 | 10167 | 10215 | 10335 | 57084 | 95148 | | Δ | 3125 | 3168 | 3164 | 3164 | 3128 | 2752 | 9959 | -43503 | | Improvement % | -46 | -46 | -45 | -45 | -44 | -36 | -21 | -43503 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 146 | 144 | 143 | 143 | 141 | 132 | 21 | 138651 | | Current_run | 100 | 99 | 98 | 98 | 98 | 97 | 18 | 95148 | | Δ | -46 | -45 | -45 | -45 | -43 | -35 | -3 | -43503 | | Improvement % | -32 | -31 | -31 | -31 | -30 | -27 | -14 | -43503 |

disk2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ms) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 9314 | 9339 | 9347 | 9388 | 9486 | 9609 | 9630 | 107 | | Current_run | 24508 | 24576 | 24707 | 24969 | 30228 | 43058 | 43058 | 39 | | Δ | 15194 | 15237 | 15360 | 15581 | 20742 | 33449 | 33428 | -68 | | Improvement % | -163 | -163 | -164 | -166 | -219 | -348 | -347 | -68 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (#) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 107 | 107 | 107 | 107 | 105 | 104 | 104 | 107 | | Current_run | 41 | 41 | 40 | 40 | 33 | 23 | 23 | 39 | | Δ | -66 | -66 | -67 | -67 | -72 | -81 | -81 | -68 | | Improvement % | -62 | -62 | -63 | -63 | -69 | -78 | -78 | -68 |

gradient2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 6750 | 6875 | 6919 | 6959 | 7003 | 7503 | 42417 | 140146 | | Current_run | 9916 | 10047 | 10087 | 10127 | 10167 | 10295 | 96958 | 95770 | | Δ | 3166 | 3172 | 3168 | 3168 | 3164 | 2792 | 54541 | -44376 | | Improvement % | -47 | -46 | -46 | -46 | -45 | -37 | -129 | -44376 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 148 | 146 | 145 | 144 | 143 | 133 | 24 | 140146 | | Current_run | 101 | 100 | 99 | 99 | 98 | 97 | 10 | 95770 | | Δ | -47 | -46 | -46 | -45 | -45 | -36 | -14 | -44376 | | Improvement % | -32 | -32 | -32 | -31 | -31 | -27 | -58 | -44376 |

gradient3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 6791 | 6919 | 6919 | 6959 | 7003 | 7503 | 44709 | 139935 | | Current_run | 9916 | 10047 | 10087 | 10127 | 10167 | 10295 | 75500 | 95903 | | Δ | 3125 | 3128 | 3168 | 3168 | 3164 | 2792 | 30791 | -44032 | | Improvement % | -46 | -45 | -46 | -46 | -45 | -37 | -69 | -44032 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08 | 147 | 145 | 145 | 144 | 143 | 133 | 22 | 139935 | | Current_run | 101 | 100 | 99 | 99 | 98 | 97 | 13 | 95903 | | Δ | -46 | -45 | -46 | -45 | -45 | -36 | -9 | -44032 | | Improvement % | -31 | -31 | -32 | -31 | -31 | -27 | -41 | -44032 |

tayloraswift commented 3 months ago

huh, lemme try running this on x86_64 when i get a chance

tayloraswift commented 3 months ago

my results are quite different. with the exception of disk2d, most of the benchmarks show a modest improvement. i don’t know what’s going on with the 99th percentiles though. maybe it was run in a noisier environment.

Comparing results between 'bdb4ef08.x86_64' and 'Current_run'

Host '832f7bfa3820' with 12 'x86_64' processors with 30 GB memory, running:
#35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2

ExternalBenchmarks

cell2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 1207 | 1258 | 1261 | 1273 | 1309 | 1559 | 40876 | 608656 | | Current_run | 1084 | 1135 | 1142 | 1156 | 1255 | 4287 | 51909 | 588363 | | Δ | -123 | -123 | -119 | -117 | -54 | 2728 | 11033 | -20293 | | Improvement % | 10 | 10 | 9 | 9 | 4 | -175 | -27 | -20293 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 828 | 795 | 793 | 786 | 764 | 642 | 24 | 608656 | | Current_run | 923 | 881 | 876 | 865 | 797 | 233 | 19 | 588363 | | Δ | 95 | 86 | 83 | 79 | 33 | -409 | -5 | -20293 | | Improvement % | 11 | 11 | 10 | 10 | 4 | -64 | -21 | -20293 |

cell3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 1237 | 1257 | 1261 | 1274 | 1337 | 4057 | 53729 | 576524 | | Current_run | 1092 | 1136 | 1143 | 1153 | 1223 | 3831 | 44478 | 626635 | | Δ | -145 | -121 | -118 | -121 | -114 | -226 | -9251 | 50111 | | Improvement % | 12 | 10 | 9 | 9 | 9 | 6 | 17 | 50111 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 808 | 796 | 793 | 785 | 748 | 247 | 19 | 576524 | | Current_run | 916 | 881 | 875 | 867 | 818 | 261 | 22 | 626635 | | Δ | 108 | 85 | 82 | 82 | 70 | 14 | 3 | 50111 | | Improvement % | 13 | 11 | 10 | 10 | 9 | 6 | 16 | 50111 |

cell_tiling3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 1237 | 1260 | 1264 | 1281 | 1314 | 4017 | 74913 | 589676 | | Current_run | 1110 | 1140 | 1148 | 1169 | 1251 | 4219 | 95476 | 602537 | | Δ | -127 | -120 | -116 | -112 | -63 | 202 | 20563 | 12861 | | Improvement % | 10 | 10 | 9 | 9 | 5 | -5 | -27 | 12861 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 808 | 794 | 792 | 781 | 761 | 249 | 13 | 589676 | | Current_run | 901 | 878 | 871 | 856 | 800 | 237 | 10 | 602537 | | Δ | 93 | 84 | 79 | 75 | 39 | -12 | -3 | 12861 | | Improvement % | 12 | 11 | 10 | 10 | 5 | -5 | -23 | 12861 |

classic3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 10 | 10 | 10 | 10 | 10 | 14 | 78 | 94960 | | Current_run | 10 | 10 | 10 | 10 | 10 | 38 | 125 | 91262 | | Δ | 0 | 0 | 0 | 0 | 0 | 24 | 47 | -3698 | | Improvement % | 0 | 0 | 0 | 0 | 0 | -171 | -60 | -3698 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 102 | 102 | 101 | 100 | 98 | 70 | 13 | 94960 | | Current_run | 104 | 103 | 103 | 101 | 98 | 26 | 8 | 91262 | | Δ | 2 | 1 | 2 | 1 | 0 | -44 | -5 | -3698 | | Improvement % | 2 | 1 | 2 | 1 | 0 | -63 | -38 | -3698 |

classic_tiling3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ns) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 1239 | 1261 | 1264 | 1275 | 1304 | 1489 | 40901 | 614952 | | Current_run | 1110 | 1142 | 1149 | 1160 | 1230 | 3821 | 41111 | 638164 | | Δ | -129 | -119 | -115 | -115 | -74 | 2332 | 210 | 23212 | | Improvement % | 10 | 9 | 9 | 9 | 6 | -157 | -1 | 23212 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 807 | 793 | 792 | 784 | 767 | 672 | 24 | 614952 | | Current_run | 901 | 876 | 870 | 862 | 813 | 262 | 24 | 638164 | | Δ | 94 | 83 | 78 | 78 | 46 | -410 | 0 | 23212 | | Improvement % | 12 | 10 | 10 | 10 | 6 | -61 | 0 | 23212 |

classic_tiling_fbm3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 10 | 10 | 10 | 10 | 11 | 29 | 128 | 90948 | | Current_run | 10 | 10 | 10 | 10 | 10 | 37 | 115 | 89509 | | Δ | 0 | 0 | 0 | 0 | -1 | 8 | -13 | -1439 | | Improvement % | 0 | 0 | 0 | 0 | 9 | -28 | 10 | -1439 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 100 | 99 | 99 | 98 | 95 | 35 | 8 | 90948 | | Current_run | 102 | 101 | 101 | 99 | 96 | 27 | 9 | 89509 | | Δ | 2 | 2 | 2 | 1 | 1 | -8 | 1 | -1439 | | Improvement % | 2 | 2 | 2 | 1 | 1 | -23 | 12 | -1439 |

disk2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (ms) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 11 | 11 | 11 | 12 | 12 | 17 | 17 | 88 | | Current_run | 16 | 16 | 16 | 16 | 16 | 18 | 18 | 63 | | Δ | 5 | 5 | 5 | 4 | 4 | 1 | 1 | -25 | | Improvement % | -45 | -45 | -45 | -33 | -33 | -6 | -6 | -25 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (#) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 94 | 91 | 90 | 87 | 84 | 59 | 59 | 88 | | Current_run | 64 | 63 | 62 | 62 | 61 | 56 | 56 | 63 | | Δ | -30 | -28 | -28 | -25 | -23 | -3 | -3 | -25 | | Improvement % | -32 | -31 | -31 | -29 | -27 | -5 | -5 | -25 |

gradient2d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 10 | 10 | 10 | 10 | 10 | 38 | 102 | 88539 | | Current_run | 10 | 10 | 10 | 10 | 11 | 37 | 104 | 86683 | | Δ | 0 | 0 | 0 | 0 | 1 | -1 | 2 | -1856 | | Improvement % | 0 | 0 | 0 | 0 | -10 | 3 | -2 | -1856 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 102 | 101 | 101 | 100 | 97 | 27 | 10 | 88539 | | Current_run | 103 | 100 | 99 | 97 | 93 | 27 | 10 | 86683 | | Δ | 1 | -1 | -2 | -3 | -4 | 0 | 0 | -1856 | | Improvement % | 1 | -1 | -2 | -3 | -4 | 0 | 0 | -1856 |

gradient3d metrics

Time (wall clock): results within specified thresholds, fold down for details.

| Time (wall clock) (μs) * | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 10 | 10 | 10 | 10 | 10 | 28 | 103 | 93008 | | Current_run | 10 | 10 | 10 | 10 | 10 | 30 | 95 | 93813 | | Δ | 0 | 0 | 0 | 0 | 0 | 2 | -8 | 805 | | Improvement % | 0 | 0 | 0 | 0 | 0 | -7 | 8 | 805 |

Throughput (# / s): results within specified thresholds, fold down for details.

| Throughput (# / s) (K) | p0 | p25 | p50 | p75 | p90 | p99 | p100 | Samples | |:----------------------------------------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:| | bdb4ef08.x86_64 | 102 | 102 | 101 | 100 | 98 | 36 | 10 | 93008 | | Current_run | 103 | 103 | 102 | 101 | 99 | 33 | 11 | 93813 | | Δ | 1 | 1 | 1 | 1 | 1 | -3 | 1 | 805 | | Improvement % | 1 | 1 | 1 | 1 | 1 | -8 | 10 | 805 |

heckj commented 3 months ago

Well that's positive at least! Just to check - did you make your own baseline to compare against, or did you compare against the built-in one, because I generated that on an M1, so I wouldn't expect it to be terribly accurate for Intel?

In any case, I thought I'd do a little profiling using the generate-noise (what had previously been in TestNoise) executable and see what hot spots existed, comparing the two from that perspective. I'm not experienced at optimizing at this level, so this'll be a good learning experience for me.

tayloraswift commented 3 months ago

yes, i had created a second baseline locally named bdb4ef08.x86_64

heckj commented 3 months ago

Well, that's reassuring then. It shows the SIMD stuff is actually adding value (at least a little) on Intel, even if it seems to be a pretty annoying regression on Arm. I'll keep looking.

heckj commented 2 months ago

Image showing a comparison of generate-noise - SIMD version (the one that's slightly slower) on top, original code below.

11 vs 15 ms for the Math.multi vs SIMD3 &* is the biggest thing that stands out to me, but almost seems counterintuitive. And since this is with Instruments, I've no explanation (or path to understanding at the moment) why the same code-path is notably faster on Intel hardware, but slower on ARM64.

Screenshot 2024-06-20 at 1 42 39 PM
tayloraswift commented 2 months ago

hmm. maybe extract one of the functions that uses Math.mult and look at its Godbolt to see if anything stands out? if i remember correctly during the Swift 3 days, SIMD frequently performed worse than scalar multiplication because LLVM was constantly packing and unpacking the scalars to and from the SIMD registers. but then again, these are not the Swift 3 days anymore so maybe it’s a completely different issue.

barring that, we could ask for help from a numerics expert but i’m not sure if calling in the cavalry is warranted yet

heckj commented 2 months ago

I'll give it a look after a bit - more I'm just confounded with this unexpected result. Godbolt seems like a really good idea, I'll give a look