Closed heshpdx closed 1 month ago
I've ported the use-mul-instead-of-div
changes to h3o because the 30% speedup was very attractive, but I haven't noticed any noticeable performance improvement.
Maybe M1 CPU have fast division already or LLVM is already doing this optimization under the hood for Rust.
Edit: cannot repro with the benchmark of this repo either. Must be HW dependent then.
I wasn't able to reproduce quite the reported performance improvements on Linux x64 w/ GCC, but I'm happy to retest on ARM later.
edit: I see performance improving by more around 10~15%
Before
build-master-jul14$ make benchmarks
[ 0%] Formatting sources
[ 0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
-- pointInsideGeoLoopSmall: 0.165765 microseconds per iteration (100000 iterations)
-- pointInsideGeoLoopLarge: 1.832082 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopSmall: 0.128193 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopLarge: 1.945774 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
-- latLngToCell: 2.400742 microseconds per iteration (10000 iterations)
-- cellToLatLng: 1.018848 microseconds per iteration (10000 iterations)
-- cellToBoundary: 5.000979 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
-- gridDisk10: 30.648170 microseconds per iteration (10000 iterations)
-- gridDisk20: 116.188511 microseconds per iteration (10000 iterations)
-- gridDisk30: 274.647540 microseconds per iteration (10000 iterations)
-- gridDisk40: 441.203441 microseconds per iteration (10000 iterations)
-- gridDiskPentagon10: 613.105132 microseconds per iteration (500 iterations)
-- gridDiskPentagon20: 5084.334198 microseconds per iteration (500 iterations)
-- gridDiskPentagon30: 17323.867540 microseconds per iteration (50 iterations)
-- gridDiskPentagon40: 40797.638900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
-- gridPathCellsNear: 58.487380 microseconds per iteration (10000 iterations)
-- gridPathCellsFar: 2616.719411 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
-- directedEdgeToBoundary: 14.005060 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
-- cellToVertexes: 10.162646 microseconds per iteration (10000 iterations)
-- cellToVertexesPent: 0.217632 microseconds per iteration (10000 iterations)
-- cellToVertexesRing: 157.010829 microseconds per iteration (10000 iterations)
-- cellToVertexesRingPent: 154.470410 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
-- pentagonChildren_2_8: 7074.462316 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14: 8923.350511 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_2: 5023.494634 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_10: 8218.255006 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_100: 8942.472348 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
-- cellsToLinkedMultiPolygonRing2: 108.960790 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonDonut: 38.634417 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonNestedDonuts: 158.458785 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
-- cellToChildren1: 0.241202 microseconds per iteration (10000 iterations)
-- cellToChildren2: 1.332053 microseconds per iteration (10000 iterations)
-- cellToChildren3: 7.849704 microseconds per iteration (10000 iterations)
-- cellToChildren4: 52.471268 microseconds per iteration (10000 iterations)
-- cellToChildren5: 369.739713 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
-- polygonToCellsSF: 4029.634296 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda: 6255.191586 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion: 188593.924100 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
-- polygonToCellsSF_Center: 2265.643132 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Full: 7476.944652 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Overlapping: 8589.903528 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Center: 5523.648154 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Full: 15981.319740 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Overlapping: 20323.545974 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion_Center: 116890.366500 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Full: 379016.690500 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Overlapping: 590245.006200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks
After
build-branch-jul14$ make benchmarks
[ 0%] Formatting sources
[ 0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
-- pointInsideGeoLoopSmall: 0.174684 microseconds per iteration (100000 iterations)
-- pointInsideGeoLoopLarge: 1.706215 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopSmall: 0.113044 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopLarge: 1.853511 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
-- latLngToCell: 2.095765 microseconds per iteration (10000 iterations)
-- cellToLatLng: 1.015881 microseconds per iteration (10000 iterations)
-- cellToBoundary: 4.406268 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
-- gridDisk10: 31.002723 microseconds per iteration (10000 iterations)
-- gridDisk20: 115.963878 microseconds per iteration (10000 iterations)
-- gridDisk30: 255.184783 microseconds per iteration (10000 iterations)
-- gridDisk40: 446.646353 microseconds per iteration (10000 iterations)
-- gridDiskPentagon10: 620.174954 microseconds per iteration (500 iterations)
-- gridDiskPentagon20: 5127.692764 microseconds per iteration (500 iterations)
-- gridDiskPentagon30: 17360.673460 microseconds per iteration (50 iterations)
-- gridDiskPentagon40: 41154.405900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
-- gridPathCellsNear: 59.351578 microseconds per iteration (10000 iterations)
-- gridPathCellsFar: 2677.547189 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
-- directedEdgeToBoundary: 14.106074 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
-- cellToVertexes: 9.734607 microseconds per iteration (10000 iterations)
-- cellToVertexesPent: 0.215882 microseconds per iteration (10000 iterations)
-- cellToVertexesRing: 160.913600 microseconds per iteration (10000 iterations)
-- cellToVertexesRingPent: 156.779922 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
-- pentagonChildren_2_8: 7027.019166 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14: 8806.731603 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_2: 4965.449012 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_10: 8126.078029 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_100: 8706.736355 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
-- cellsToLinkedMultiPolygonRing2: 110.695771 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonDonut: 39.187226 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonNestedDonuts: 160.627655 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
-- cellToChildren1: 0.211110 microseconds per iteration (10000 iterations)
-- cellToChildren2: 1.388388 microseconds per iteration (10000 iterations)
-- cellToChildren3: 8.871911 microseconds per iteration (10000 iterations)
-- cellToChildren4: 56.922808 microseconds per iteration (10000 iterations)
-- cellToChildren5: 391.073105 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
-- polygonToCellsSF: 3899.409916 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda: 6277.127410 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion: 188710.784900 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
-- polygonToCellsSF_Center: 2175.312946 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Full: 7408.483802 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Overlapping: 8448.251498 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Center: 5296.558980 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Full: 15343.415832 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Overlapping: 19566.347054 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion_Center: 113208.269200 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Full: 363013.989700 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Overlapping: 559297.645200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks
@isaacbrodsky your benchmark does show an improvement on latlngToCell
from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.
@isaacbrodsky your benchmark does show an improvement on
latlngToCell
from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.
Sorry, I was imprecise. I did see performance improvements in many benchmarks, but more on the order of 10~15% rather than the 30% reported.
The benefit is definitely microarchitecture specific based on how the FPU is implemented, and latency and throughput of individual operations. Also, most CPUs implement "early-out" divides, so if the computation is like {N/1, 0/N, N/N, N<<2, etc} then it doesn't incur the full latency (e.g. if unit tests have zero dividend there will be no perf benefit) . I just ran "make benchmarks" and pulled a few which looked significant:
old -- latLngToCell: 2.366658 microseconds per iteration (10000 iterations)
new -- latLngToCell: 1.635445 microseconds per iteration (10000 iterations)
old -- cellToChildren1: 0.404193 microseconds per iteration (10000 iterations)
new -- cellToChildren1: 0.147156 microseconds per iteration (10000 iterations)
old -- cellToChildren2: 1.099871 microseconds per iteration (10000 iterations)
new -- cellToChildren2: 0.750266 microseconds per iteration (10000 iterations)
That's {1.4x, 2.7x, 1.5x}, as measured on my Ampere AltraMax. The 1.3x I cited was from our SPEC CPU input. Thanks for considering this PR.
I get similar or even better (40% on cellToLatLng) performance improvements when I test on Linux ARM:
Before
~/oss/h3/build $ make benchmarks
[ 0%] Formatting sources
[ 0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
-- pointInsideGeoLoopSmall: 0.237791 microseconds per iteration (100000 iterations)
-- pointInsideGeoLoopLarge: 1.953805 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopSmall: 0.221790 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopLarge: 2.608292 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
-- latLngToCell: 6.158289 microseconds per iteration (10000 iterations)
-- cellToLatLng: 3.538159 microseconds per iteration (10000 iterations)
-- cellToBoundary: 16.000204 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
-- gridDisk10: 46.712590 microseconds per iteration (10000 iterations)
-- gridDisk20: 172.776119 microseconds per iteration (10000 iterations)
-- gridDisk30: 379.537284 microseconds per iteration (10000 iterations)
-- gridDisk40: 665.536855 microseconds per iteration (10000 iterations)
-- gridDiskPentagon10: 974.917548 microseconds per iteration (500 iterations)
-- gridDiskPentagon20: 7932.902812 microseconds per iteration (500 iterations)
-- gridDiskPentagon30: 27031.574120 microseconds per iteration (50 iterations)
-- gridDiskPentagon40: 65397.877600 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
-- gridPathCellsNear: 67.016416 microseconds per iteration (10000 iterations)
-- gridPathCellsFar: 3043.141366 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
-- directedEdgeToBoundary: 40.614495 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
-- cellToVertexes: 13.928412 microseconds per iteration (10000 iterations)
-- cellToVertexesPent: 0.383176 microseconds per iteration (10000 iterations)
-- cellToVertexesRing: 216.126529 microseconds per iteration (10000 iterations)
-- cellToVertexesRingPent: 224.302782 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
-- pentagonChildren_2_8: 13482.154379 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14: 13888.525799 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_2: 7786.916335 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_10: 12766.925168 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_100: 13777.683675 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
-- cellsToLinkedMultiPolygonRing2: 423.303284 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonDonut: 157.237177 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonNestedDonuts: 625.338030 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
-- cellToChildren1: 0.244395 microseconds per iteration (10000 iterations)
-- cellToChildren2: 1.357393 microseconds per iteration (10000 iterations)
-- cellToChildren3: 9.080074 microseconds per iteration (10000 iterations)
-- cellToChildren4: 63.147554 microseconds per iteration (10000 iterations)
-- cellToChildren5: 441.493719 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
-- polygonToCellsSF: 10539.029034 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda: 14892.152532 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion: 455600.007400 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
-- polygonToCellsSF_Center: 7021.455078 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Full: 26996.973598 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Overlapping: 28265.139666 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Center: 13734.053836 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Full: 51138.265554 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Overlapping: 58866.366632 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion_Center: 304419.850000 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Full: 1275601.226200 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Overlapping: 1790633.328600 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks
After
~/oss/h3-copy/build $ make benchmarks
[ 0%] Formatting sources
[ 0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
-- pointInsideGeoLoopSmall: 0.242731 microseconds per iteration (100000 iterations)
-- pointInsideGeoLoopLarge: 1.989570 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopSmall: 0.223000 microseconds per iteration (100000 iterations)
-- bboxFromGeoLoopLarge: 2.658519 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
-- latLngToCell: 3.780628 microseconds per iteration (10000 iterations)
-- cellToLatLng: 2.141569 microseconds per iteration (10000 iterations)
-- cellToBoundary: 10.879162 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
-- gridDisk10: 46.536392 microseconds per iteration (10000 iterations)
-- gridDisk20: 173.230969 microseconds per iteration (10000 iterations)
-- gridDisk30: 380.076526 microseconds per iteration (10000 iterations)
-- gridDisk40: 666.374863 microseconds per iteration (10000 iterations)
-- gridDiskPentagon10: 980.303592 microseconds per iteration (500 iterations)
-- gridDiskPentagon20: 7948.988960 microseconds per iteration (500 iterations)
-- gridDiskPentagon30: 27231.112900 microseconds per iteration (50 iterations)
-- gridDiskPentagon40: 66191.866500 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
-- gridPathCellsNear: 67.183286 microseconds per iteration (10000 iterations)
-- gridPathCellsFar: 3054.412760 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
-- directedEdgeToBoundary: 30.176533 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
-- cellToVertexes: 13.611636 microseconds per iteration (10000 iterations)
-- cellToVertexesPent: 0.385624 microseconds per iteration (10000 iterations)
-- cellToVertexesRing: 212.934427 microseconds per iteration (10000 iterations)
-- cellToVertexesRingPent: 224.648723 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
-- pentagonChildren_2_8: 13472.980062 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14: 13887.771011 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_2: 7781.522597 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_10: 12761.149156 microseconds per iteration (1000 iterations)
-- pentagonChildren_8_14_null_100: 13773.922437 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
-- cellsToLinkedMultiPolygonRing2: 320.794363 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonDonut: 124.114011 microseconds per iteration (10000 iterations)
-- cellsToLinkedMultiPolygonNestedDonuts: 492.473339 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
-- cellToChildren1: 0.255307 microseconds per iteration (10000 iterations)
-- cellToChildren2: 1.386753 microseconds per iteration (10000 iterations)
-- cellToChildren3: 9.292348 microseconds per iteration (10000 iterations)
-- cellToChildren4: 64.225439 microseconds per iteration (10000 iterations)
-- cellToChildren5: 443.989882 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
-- polygonToCellsSF: 7519.098276 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda: 11145.530170 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion: 351837.750500 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
-- polygonToCellsSF_Center: 4643.820966 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Full: 17948.688888 microseconds per iteration (500 iterations)
-- polygonToCellsSF_Overlapping: 18913.791116 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Center: 9732.431998 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Full: 34826.282658 microseconds per iteration (500 iterations)
-- polygonToCellsAlameda_Overlapping: 40562.522346 microseconds per iteration (500 iterations)
-- polygonToCellsSouthernExpansion_Center: 209794.639100 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Full: 855543.199300 microseconds per iteration (10 iterations)
-- polygonToCellsSouthernExpansion_Overlapping: 1222980.075300 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks
@heshpdx Thanks for improving the performance here!
This completes the work from #790, where we started the removal of "long double" types.
Additionally, there is a easy performance improvement opportunity through changing some FDIV's into FMUL's. In modern CPUs, divides usually takes 3 to 4 times as long to complete compared to multiply, so we can convert the high impact divide operations by defining literals where the inverse is pre-computed. Removing divides from loops has a big impact. I measured a 30% speedup in
cellToLatLng
andcellToBoundary
on my machine. Please see what you can achieve on yours. Thank you!