Closed grigasp closed 1 year ago
In this situation there's one backend process (with default config) handling all requests of a single user. This should provide a good sense of the best possible performance we can expect from each implementation.
Metric | Stateless | Native (cold cache) | Native (warm cache) |
---|---|---|---|
itwin.nodes_request.response_time: | |||
...min | 0 | 50 | 23 |
...max | 802 | 10809 | 4609 |
...median | 1 | 7709.8 | 2671 |
...p95 | 713.5 | 10201.2 | 4583.6 |
...p99 | 788.5 | 10617.5 | 4583.6 |
http.requests | 911 | 2018 | 2018 |
vusers.session_length: | |||
...min | 1900.6 | 47225.8 | 16710.9 |
...max | 1900.6 | 47225.8 | 16710.9 |
...median | 1901.1 | 47586.7 | 16819.2 |
...p95 | 1901.1 | 47586.7 | 16819.2 |
...p99 | 1901.1 | 47586.7 | 16819.2 |
stats: | |||
...numbers | |||
...graphs |
The results show that the stateless implementation outperforms the native one even when using the warm cache. This is mostly because of smaller number of requests it has to make to create the hierarchy - the stateless implementation is built with a hard limit of nodes it loads for a single hierarchy level, which allows it to do some optimizations like perform grouping on the frontend rather than the backend, which reduces the number of requests it has to make against the backend.
In addition, the stateless implementation requires much less resources (see Peak Private Bytes
, I/O Reads
, I/O Writes
in the stat images).
We're comparing how performance of creating the full Models tree hierarchy is affected by the number of users simultaneously creating it.
We're checking how the stateless library implementation performs with default concurrent query config (4 query runner threads).
Metric | 1 user | 2 users | 4 users | 8 users | 16 users | 100 users |
---|---|---|---|---|---|---|
itwin.nodes_request.response_time: | ||||||
...min | 0 | 0 | 0 | 0 | 0 | 0 |
...max | 802 | 1532 | 3225 | 6817 | 16515 | 153334 |
...median | 1 | 1 | 1 | 1 | 1 | 1 |
...p95 | 713.5 | 1408.4 | 2618.1 | 5378.9 | 13230.3 | 131971.7 |
...p99 | 788.5 | 1495.5 | 3011.6 | 6187.2 | 14917.2 | 145851.8 |
http.requests | 911 | 1822 | 3644 | 7288 | 14576 | 91100 |
itwin.query_rows.response_time: | ||||||
...min | 5 | 4 | 4 | 5 | 8 | 31 |
...max | 143 | 219 | 362 | 620 | 1403 | 13403 |
...median | 12.1 | 25.8 | 51.9 | 106.7 | 257.3 | 2186.8 |
...p95 | 43.4 | 83.9 | 102.5 | 198.4 | 632.8 | 5378.9 |
...p99 | 104.6 | 156 | 186.8 | 518.1 | 1200.1 | 9801.2 |
itwin.schema_json.response_time: | ||||||
...min | 2 | 2 | 2 | 2 | 4 | 185 |
...max | 75 | 116 | 151 | 428 | 974 | 12509 |
...median | 13.9 | 16.9 | 29.1 | 71.5 | 278.7 | 3262.4 |
...p95 | 24.8 | 104.6 | 130.3 | 320.6 | 620.3 | 9607.1 |
...p99 | 24.8 | 104.6 | 147 | 424.2 | 804.5 | 11274.1 |
vusers.completed | 1 | 2 | 4 | 8 | 16 | 100 |
vusers.session_length: | ||||||
...min | 1900.6 | 3631.3 | 5689.6 | 12396 | 29446.7 | 258792.5 |
...max | 1900.6 | 3640.3 | 6058.8 | 12713.3 | 30327.5 | 265409.3 |
...median | 1901.1 | 3605.5 | 5944.6 | 12459.8 | 30040.3 | 260502 |
...p95 | 1901.1 | 3605.5 | 6064.7 | 12711.5 | 30040.3 | 265764.6 |
...p99 | 1901.1 | 3605.5 | 6064.7 | 12711.5 | 30040.3 | 265764.6 |
The results show that doubling users' count nearly doubles the time it takes for them to create the hierarchy.
Metric | 1 user | 2 users | 4 users | 8 users | 16 users |
---|---|---|---|---|---|
itwin.nodes_request.response_time: | |||||
...min | 29 | 31 | 32 | 27 | 38 |
...max | 14557 | 28824 | 44401 | 148789 | 283461 |
...median | 12968.3 | 18220 | 30040.3 | 61717.2 | 86710.4 |
...p95 | 14048.5 | 27181.5 | 42205.5 | 84993.4 | 164448.1 |
...p99 | 14332.3 | 28290.8 | 43058.1 | 90249.2 | 189161.2 |
http.requests | 2019 | 4039 | 8083 | 16261 | 32678 |
vusers.completed | 1 | 2 | 4 | 8 | 16 |
vusers.session_length: | |||||
...min | 56008.5 | 86025.5 | 116437.6 | 223493.6 | 292600 |
...max | 56008.5 | 86097.9 | 119656.8 | 236531.4 | 431296.9 |
...median | 55843.8 | 86710.4 | 117048 | 231043.6 | 412660.7 |
...p95 | 55843.8 | 86710.4 | 117048 | 235711.1 | 429502.3 |
...p99 | 55843.8 | 86710.4 | 117048 | 235711.1 | 429502.3 |
Similar to the stateless implementation, the performance degrades as the number of users grows.
Because scalability is being tested on a single machine with limited number of cores for multithreading, we want to reduce the number of worker threads used for running the queries so we can see how increasing the number of processes affects the performance. Otherwise, additional processes will have to share the same processor cores and increasing processes count won't have the desired effect. The minimum number of query runner threads is 2, so we're using that.
In the real world application the backend processes could run on separate machines, so we would have the benefits of both large number of threads per process and large number of processes without them needing to share the same resources.
Metric | 1 user | 2 users | 4 users | 8 users | 16 users | 100 users |
---|---|---|---|---|---|---|
itwin.nodes_request.response_time: | ||||||
...min | 0 | 0 | 0 | 0 | 0 | 0 |
...max | 1080 | 2194 | 3636 | 6820 | 19499 | 183221 |
...median | 1 | 2 | 1 | 1 | 1 | 1 |
...p95 | 963.1 | 1978.7 | 3134.5 | 5487.5 | 15526 | 140132.7 |
...p99 | 1043.3 | 2101.1 | 3464.1 | 6187.2 | 17505.6 | 157999.8 |
http.requests | 911 | 1822 | 3644 | 7288 | 14576 | 91100 |
itwin.query_rows.response_time: | ||||||
...min | 7 | 5 | 5 | 4 | 9 | 40 |
...max | 561 | 418 | 442 | 799 | 2200 | 13000 |
...median | 18 | 37 | 62.2 | 108.9 | 295.9 | 2566.3 |
...p95 | 48.9 | 90.9 | 113.3 | 214.9 | 854.2 | 6976.1 |
...p99 | 247.2 | 247.2 | 242.3 | 450.4 | 1436.8 | 9801.2 |
itwin.schema_json.response_time: | ||||||
...min | 3 | 3 | 3 | 2 | 3 | 12 |
...max | 276 | 227 | 271 | 549 | 1023 | 12011 |
...median | 10.9 | 18 | 26.8 | 66 | 361.5 | 3534.1 |
...p95 | 45.2 | 172.5 | 135.7 | 295.9 | 788.5 | 9607.1 |
...p99 | 45.2 | 172.5 | 179.5 | 528.6 | 925.4 | 10832 |
vusers.completed | 1 | 2 | 4 | 8 | 16 | 100 |
vusers.session_length: | ||||||
...min | 3356.6 | 4547.4 | 6962.8 | 12177.2 | 35856.4 | 293976.8 |
...max | 3356.6 | 4767.3 | 7172.5 | 12509.9 | 37057 | 309541.8 |
...median | 3328.3 | 4583.6 | 7117 | 12213.1 | 36691.5 | 305703.5 |
...p95 | 3328.3 | 4583.6 | 7117 | 12459.8 | 36691.5 | 311879.3 |
...p99 | 3328.3 | 4583.6 | 7117 | 12459.8 | 36691.5 | 311879.3 |
Interestingly, comparing to 4 query runner threads, the performance is much worse for 1-2 users cases, but when adding more users, the performance difference gets smaller.
Metric | 1 user | 2 users | 4 users | 8 users | 16 users | 100 users |
---|---|---|---|---|---|---|
itwin.nodes_request.response_time: | ||||||
...min | 0 | 0 | 0 | 0 | 0 | 0 |
...max | 718 | 1087 | 2112 | 5739 | 12850 | 127857 |
...median | 1 | 2 | 2 | 2 | 1 | 1 |
...p95 | 620.3 | 963.1 | 1826.6 | 4867 | 10407.3 | 99741.2 |
...p99 | 699.4 | 1022.7 | 2059.5 | 5378.9 | 11971.2 | 114730.2 |
http.requests | 911 | 1822 | 3644 | 7288 | 14576 | 91100 |
itwin.query_rows.response_time: | ||||||
...min | 3 | 4 | 4 | 4 | 5 | 11 |
...max | 115 | 253 | 600 | 1313 | 4574 | 8753 |
...median | 10.9 | 18 | 32.1 | 92.8 | 206.5 | 1790.4 |
...p95 | 37 | 40.9 | 98.5 | 232.8 | 278.7 | 4316.6 |
...p99 | 80.6 | 149.9 | 284.3 | 459.5 | 561.2 | 5168 |
itwin.schema_json.response_time: | ||||||
...min | 3 | 2 | 3 | 2 | 3 | 29 |
...max | 90 | 219 | 375 | 422 | 827 | 6265 |
...median | 6 | 7 | 13.9 | 30.3 | 90.9 | 2143.5 |
...p95 | 16.9 | 183.1 | 149.9 | 262.5 | 376.2 | 4770.6 |
...p99 | 16.9 | 183.1 | 333.7 | 415.8 | 497.8 | 5487.5 |
vusers.completed | 1 | 2 | 4 | 8 | 16 | 100 |
vusers.session_length: | ||||||
...min | 1713.2 | 2686.4 | 4458.2 | 10242.3 | 19823.8 | 190318.8 |
...max | 1713.2 | 2918 | 5286.4 | 11229.4 | 22866.6 | 206870.9 |
...median | 1720.2 | 2671 | 5065.6 | 10617.5 | 21813.5 | 200858.7 |
...p95 | 1720.2 | 2671 | 5168 | 11050.8 | 22703.7 | 204916.5 |
...p99 | 1720.2 | 2671 | 5168 | 11050.8 | 22703.7 | 204916.5 |
Introducing an additional backend process substantially increased the performance over 1 backend process and even shows noticeable better results than 1 backend process with 4 query runner threads (total number of query runner threads is the same). This could be explained by 2 processes being able to do (de)serialization work in parallel while 1 backend process having to do that in serial manner. Overall, we still see a substantial performance degradation when users' count increases.
Metric | 1 user | 2 users | 4 users | 8 users | 16 users | 100 users |
---|---|---|---|---|---|---|
itwin.nodes_request.response_time: | ||||||
...min | 0 | 0 | 0 | 0 | 0 | 0 |
...max | 898 | 1112 | 2058 | 5186 | 11008 | 85060 |
...median | 1 | 1 | 2 | 2 | 1 | 1 |
...p95 | 788.5 | 1022.7 | 1755 | 3984.7 | 8692.8 | 64236 |
...p99 | 889.1 | 1085.9 | 1978.7 | 4583.6 | 9801.2 | 75382 |
http.requests | 911 | 1822 | 3644 | 7288 | 14576 | 91100 |
itwin.query_rows.response_time: | ||||||
...min | 5 | 4 | 6 | 5 | 5 | 6 |
...max | 159 | 285 | 402 | 825 | 2361 | 7244 |
...median | 13.9 | 18 | 32.1 | 73 | 169 | 1153.1 |
...p95 | 34.1 | 36.2 | 96.6 | 179.5 | 333.7 | 3752.7 |
...p99 | 96.6 | 165.7 | 301.9 | 327.1 | 713.5 | 4867 |
itwin.schema_json.response_time: | ||||||
...min | 2 | 2 | 3 | 2 | 3 | 84 |
...max | 133 | 263 | 285 | 658 | 1275 | 6911 |
...median | 7.9 | 7.9 | 15 | 27.9 | 77.5 | 2416.8 |
...p95 | 37.7 | 228.2 | 257.3 | 407.5 | 497.8 | 4965.3 |
...p99 | 37.7 | 228.2 | 278.7 | 620.3 | 1130.2 | 5711.5 |
vusers.completed | 1 | 2 | 4 | 8 | 16 | 100 |
vusers.session_length: | ||||||
...min | 2189 | 2737.2 | 4842.5 | 8130.9 | 17726.8 | 151280.1 |
...max | 2189 | 2835.1 | 5105.4 | 9991.2 | 19637.5 | 159771.2 |
...median | 2186.8 | 2725 | 4867 | 9230.4 | 19346.7 | 154871.1 |
...p95 | 2186.8 | 2725 | 4965.3 | 9801.2 | 19737.6 | 157999.8 |
...p99 | 2186.8 | 2725 | 4965.3 | 9801.2 | 19737.6 | 157999.8 |
Doubling backend processes to 4, we see that performance for 1-4 users doesn't change, but is noticeably better with larger number of users.
Metric | 1 user | 2 users | 4 users | 8 users | 16 users | 100 users |
---|---|---|---|---|---|---|
itwin.nodes_request.response_time: | ||||||
...min | 0 | 0 | 0 | 0 | 0 | 0 |
...max | 848 | 1195 | 1955 | 4329 | 9363 | 77186 |
...median | 1 | 2 | 2 | 2 | 1 | 1 |
...p95 | 742.6 | 1043.3 | 1720.2 | 3328.3 | 7260.8 | 59297.1 |
...p99 | 820.7 | 1153.1 | 1826.6 | 3752.7 | 8352 | 66857.6 |
http.requests | 911 | 1822 | 3644 | 7288 | 14576 | 91100 |
itwin.query_rows.response_time: | ||||||
...min | 4 | 4 | 5 | 4 | 5 | 9 |
...max | 172 | 211 | 696 | 1074 | 3995 | 5947 |
...median | 13.1 | 19.1 | 23.8 | 50.9 | 135.7 | 1107.9 |
...p95 | 23.8 | 41.7 | 135.7 | 232.8 | 432.7 | 2951.9 |
...p99 | 117.9 | 159.2 | 295.9 | 407.5 | 742.6 | 3678.4 |
itwin.schema_json.response_time: | ||||||
...min | 3 | 3 | 2 | 3 | 7 | 41 |
...max | 94 | 139 | 427 | 693 | 1588 | 4701 |
...median | 7 | 7.9 | 12.1 | 30.3 | 83.9 | 1686.1 |
...p95 | 19.1 | 135.7 | 242.3 | 399.5 | 1002.4 | 3197.8 |
...p99 | 19.1 | 135.7 | 340.4 | 645.6 | 1408.4 | 3752.7 |
vusers.completed | 1 | 2 | 4 | 8 | 16 | 100 |
vusers.session_length: | ||||||
...min | 1929.9 | 2808.3 | 5061 | 8212.3 | 16228.9 | 108969.4 |
...max | 1929.9 | 2871.2 | 5124.1 | 8869.3 | 19919.2 | 136988.6 |
...median | 1939.5 | 2836.2 | 5065.6 | 8692.8 | 18588.1 | 131971.7 |
...p95 | 1939.5 | 2836.2 | 5065.6 | 8868.4 | 19737.6 | 137357.8 |
...p99 | 1939.5 | 2836.2 | 5065.6 | 8868.4 | 19737.6 | 137357.8 |
Doubling backend processes to 8, we see that performance for 1-16 users doesn't change, but is slightly better with 100 users.
Here we take the cases with 100 users from the results from previous sections to better see how they change as the number of backend processes increases.
Metric | 1 backend | 2 backends | 4 backends | 8 backends |
---|---|---|---|---|
itwin.nodes_request.response_time: | ||||
...min | 0 | 0 | 0 | 0 |
...max | 183221 | 127857 | 85060 | 77186 |
...median | 1 | 1 | 1 | 1 |
...p95 | 140132.7 | 99741.2 | 64236 | 59297.1 |
...p99 | 157999.8 | 114730.2 | 75382 | 66857.6 |
http.requests | 91100 | 91100 | 91100 | 91100 |
itwin.query_rows.response_time: | ||||
...min | 40 | 11 | 6 | 9 |
...max | 13000 | 8753 | 7244 | 5947 |
...median | 2566.3 | 1790.4 | 1153.1 | 1107.9 |
...p95 | 6976.1 | 4316.6 | 3752.7 | 2951.9 |
...p99 | 9801.2 | 5168 | 4867 | 3678.4 |
itwin.schema_json.response_time: | ||||
...min | 12 | 29 | 84 | 41 |
...max | 12011 | 6265 | 6911 | 4701 |
...median | 3534.1 | 2143.5 | 2416.8 | 1686.1 |
...p95 | 9607.1 | 4770.6 | 4965.3 | 3197.8 |
...p99 | 10832 | 5487.5 | 5711.5 | 3752.7 |
vusers.completed | 100 | 100 | 100 | 100 |
vusers.session_length: | ||||
...min | 293976.8 | 190318.8 | 151280.1 | 108969.4 |
...max | 309541.8 | 206870.9 | 159771.2 | 136988.6 |
...median | 305703.5 | 200858.7 | 154871.1 | 131971.7 |
...p95 | 311879.3 | 204916.5 | 157999.8 | 137357.8 |
...p99 | 311879.3 | 204916.5 | 157999.8 | 137357.8 |
The table clearly shows that scaling the backend does help - all the times get reduced. And as mentioned earlier, the effect should be even higher the processes run on different machines.
In this section we're comparing how performance of the native implementation is affected by scaling the backend for 16 users.
Metric | 1 backend | 2 backends | 4 backends |
---|---|---|---|
itwin.nodes_request.response_time: | |||
...min | 38 | 151 | 102 |
...max | 283461 | 321151 | 366406 |
...median | 86710.4 | 70992 | 75382 |
...p95 | 164448.1 | 137357.8 | 140132.7 |
...p99 | 189161.2 | 148798.3 | 154871.1 |
http.requests | 32678 | 32791 | 32692 |
vusers.completed | 16 | 16 | 16 |
vusers.session_length: | |||
...min | 292600 | 324991.4 | 332269.2 |
...max | 431296.9 | 407962.6 | 408147.4 |
...median | 412660.7 | 396479.5 | 404489.2 |
...p95 | 429502.3 | 404489.2 | 404489.2 |
...p99 | 429502.3 | 404489.2 | 404489.2 |
The table clearly shows that increasing the number of backend processes has little to no effect on performance for the end users.
The tests mimic the Models Tree initial load tests that are run as part of iTwin Platform visualization performance test suite. The goal is to be under 5 seconds here and we see that this is mostly achieved except for just a couple of cases where it's not achieved with native implementation either (in cold cache situation).
Overall, the native implementation in warm cache situations is the fastest, but it's not guaranteed - end users see a mix of cold and warm cache situations. The stateless implementation, on the other hand, doesn't involve any caches, so it's performance is expected to be more stable. Comparing stateless vs native cold cache implementations, we see that stateless for the most part performs better except a few cases.
iModel | Stateless | Native (cold cache) | Native (warm cache) |
---|---|---|---|
S - 1 | 159.2 | 232.8 | 71.5 |
S - 2 | 149.9 | 295.9 | 79.1 |
S - 3 | 159.2 | 290.1 | 106.7 |
S - 4 | 159.2 | 295.9 | 87.4 |
S - 5 | 156 | 361.5 | 125.2 |
S - 6 | 169 | 441.5 | 111.1 |
S - 7 | 147 | 584.2 | 102.5 |
M - 1 | 156 | 242.3 | 71.5 |
M - 2 | 172.5 | 383.8 | 111.1 |
M - 3 | 169 | 333.7 | 82.3 |
M - 4 | 273.2 | 497.8 | 79.1 |
M - 5 | 347.3 | 889.1 | 104.6 |
M - 6 | 156 | 262.5 | 73 |
M - 7 | 1224.4 | 2416.8 | 290.1 |
M - 8 | 165.7 | 301.9 | 87.4 |
M - 9 | 159.2 | 262.5 | 83.9 |
M - 10 | 175.9 | 327.1 | 79.1 |
M - 11 | 165.7 | 407.5 | 102.5 |
M - 12 | 156 | 383.8 | 85.6 |
M - 13 | 159.2 | 320.6 | 79.1 |
M - 14 | 214.9 | 528.6 | 90.9 |
M - 15 | 214.9 | 278.7 | 71.5 |
L - 1 | 6838 | 6312.2 | 198.4 |
L - 2 | 1408.4 | 1130.2 | 111.1 |
L - 3 | 4147.4 | 4231.1 | 90.9 |
L - 4 | 1380.5 | 1436.8 | 115.6 |
L - 5 | 699.4 | 459.5 | 80.6 |
L - 6 | 632.8 | 528.6 | 102.5 |
L - 7 | 172.5 | 424.2 | 77.5 |
L - 8 | 314.2 | 788.5 | 113.3 |
L - 9 | 327.1 | 742.6 | 144 |
L - 10 | 262.5 | 713.5 | 106.7 |
XL - 1 | 135.7 | 399.5 | 98.5 |
XL - 2 | 820.7 | 1274.3 | 133 |
XL - 3 | 671.9 | 742.6 | 82.3 |
XL - 4 | 3134.5 | 1495.5 | 144 |
XL - 5 | 561.2 | 2893.5 | 273.2 |
XL - 6 | 36691.5 | 24594.7 | 4065.2 |
XL - 7 | 478.3 | 1587.9 | 295.9 |
The stateless implementation shows not only order of magnitude better performance when creating full Models Tree hierarchy, compared to the native implementation, but also outperforms it in majority of Models Tree initial load test cases when compared to native implementation in cold cache situation.
Furthermore, scalability tests show that the stateless implementation scales much better, substantially improving request performance as the number of backend processes grows. The native implementation, on the other hand, shows nearly no improvement.
The stateless implementation outperforms the native one because of:
The load test results have multiple metrics, most important of which, I believe are:
itwin.nodes_request.response_time
andvusers.session_length
. The stateless implementation makes two type of requests when creating nodes: "query rows" and "schema json", so it's also interesting to see how long they take (represented byitwin.query_rows.response_time
anditwin.schema_json.response_time
metrics).Nodes response time (
itwin.nodes_request.response_time
) shows how long a user has to wait to get the child nodes for a parent. In case of the stateless implementation this is different from just making a request, because creating nodes may involve making more than 1 request and additional post-processing (sorting, grouping, etc.). In case of the the native implementation, everything's done on the backend, so the timings for making the request nearly match the timings of creating the nodes.Virtual user session length (
vusers.session_length
) shows how long it takes a user to load all the hierarchy.Ideally, we want those metrics to be affected by adding additional virtual users as little as possible.