iTwin / presentation

Monorepo for iTwin.js Presentation Library
https://www.itwinjs.org/presentation/
MIT License
4 stars 0 forks source link

Proof of concept for stateless presentation-driven tree data provider #6

Closed grigasp closed 1 year ago

grigasp commented 1 year ago

The load test results have multiple metrics, most important of which, I believe are: itwin.nodes_request.response_time and vusers.session_length. The stateless implementation makes two type of requests when creating nodes: "query rows" and "schema json", so it's also interesting to see how long they take (represented by itwin.query_rows.response_time and itwin.schema_json.response_time metrics).

Nodes response time (itwin.nodes_request.response_time) shows how long a user has to wait to get the child nodes for a parent. In case of the stateless implementation this is different from just making a request, because creating nodes may involve making more than 1 request and additional post-processing (sorting, grouping, etc.). In case of the the native implementation, everything's done on the backend, so the timings for making the request nearly match the timings of creating the nodes.

Virtual user session length (vusers.session_length) shows how long it takes a user to load all the hierarchy.

Ideally, we want those metrics to be affected by adding additional virtual users as little as possible.

grigasp commented 1 year ago

Best case scenario

In this situation there's one backend process (with default config) handling all requests of a single user. This should provide a good sense of the best possible performance we can expect from each implementation.

Metric Stateless Native (cold cache) Native (warm cache)
itwin.nodes_request.response_time:
...min 0 50 23
...max 802 10809 4609
...median 1 7709.8 2671
...p95 713.5 10201.2 4583.6
...p99 788.5 10617.5 4583.6
http.requests 911 2018 2018
vusers.session_length:
...min 1900.6 47225.8 16710.9
...max 1900.6 47225.8 16710.9
...median 1901.1 47586.7 16819.2
...p95 1901.1 47586.7 16819.2
...p99 1901.1 47586.7 16819.2
stats:
...numbers 1proc_stateless_metrics_2 1proc_native_metrics_2 1proc_native_warm_metrics
...graphs 1proc_stateless_graph_2 1proc_native_graph_2 1proc_native_warm_graph

The results show that the stateless implementation outperforms the native one even when using the warm cache. This is mostly because of smaller number of requests it has to make to create the hierarchy - the stateless implementation is built with a hard limit of nodes it loads for a single hierarchy level, which allows it to do some optimizations like perform grouping on the frontend rather than the backend, which reduces the number of requests it has to make against the backend.

In addition, the stateless implementation requires much less resources (see Peak Private Bytes, I/O Reads, I/O Writes in the stat images).

grigasp commented 1 year ago

Load tests

We're comparing how performance of creating the full Models tree hierarchy is affected by the number of users simultaneously creating it.

Stateless implementation

We're checking how the stateless library implementation performs with default concurrent query config (4 query runner threads).

Metric 1 user 2 users 4 users 8 users 16 users 100 users
itwin.nodes_request.response_time:
...min 0 0 0 0 0 0
...max 802 1532 3225 6817 16515 153334
...median 1 1 1 1 1 1
...p95 713.5 1408.4 2618.1 5378.9 13230.3 131971.7
...p99 788.5 1495.5 3011.6 6187.2 14917.2 145851.8
http.requests 911 1822 3644 7288 14576 91100
itwin.query_rows.response_time:
...min 5 4 4 5 8 31
...max 143 219 362 620 1403 13403
...median 12.1 25.8 51.9 106.7 257.3 2186.8
...p95 43.4 83.9 102.5 198.4 632.8 5378.9
...p99 104.6 156 186.8 518.1 1200.1 9801.2
itwin.schema_json.response_time:
...min 2 2 2 2 4 185
...max 75 116 151 428 974 12509
...median 13.9 16.9 29.1 71.5 278.7 3262.4
...p95 24.8 104.6 130.3 320.6 620.3 9607.1
...p99 24.8 104.6 147 424.2 804.5 11274.1
vusers.completed 1 2 4 8 16 100
vusers.session_length:
...min 1900.6 3631.3 5689.6 12396 29446.7 258792.5
...max 1900.6 3640.3 6058.8 12713.3 30327.5 265409.3
...median 1901.1 3605.5 5944.6 12459.8 30040.3 260502
...p95 1901.1 3605.5 6064.7 12711.5 30040.3 265764.6
...p99 1901.1 3605.5 6064.7 12711.5 30040.3 265764.6

The results show that doubling users' count nearly doubles the time it takes for them to create the hierarchy.

Native implementation

Metric 1 user 2 users 4 users 8 users 16 users
itwin.nodes_request.response_time:
...min 29 31 32 27 38
...max 14557 28824 44401 148789 283461
...median 12968.3 18220 30040.3 61717.2 86710.4
...p95 14048.5 27181.5 42205.5 84993.4 164448.1
...p99 14332.3 28290.8 43058.1 90249.2 189161.2
http.requests 2019 4039 8083 16261 32678
vusers.completed 1 2 4 8 16
vusers.session_length:
...min 56008.5 86025.5 116437.6 223493.6 292600
...max 56008.5 86097.9 119656.8 236531.4 431296.9
...median 55843.8 86710.4 117048 231043.6 412660.7
...p95 55843.8 86710.4 117048 235711.1 429502.3
...p99 55843.8 86710.4 117048 235711.1 429502.3

Similar to the stateless implementation, the performance degrades as the number of users grows.

grigasp commented 1 year ago

Scalability tests

Stateless implementation

Because scalability is being tested on a single machine with limited number of cores for multithreading, we want to reduce the number of worker threads used for running the queries so we can see how increasing the number of processes affects the performance. Otherwise, additional processes will have to share the same processor cores and increasing processes count won't have the desired effect. The minimum number of query runner threads is 2, so we're using that.

In the real world application the backend processes could run on separate machines, so we would have the benefits of both large number of threads per process and large number of processes without them needing to share the same resources.

1 backend process

Metric 1 user 2 users 4 users 8 users 16 users 100 users
itwin.nodes_request.response_time:
...min 0 0 0 0 0 0
...max 1080 2194 3636 6820 19499 183221
...median 1 2 1 1 1 1
...p95 963.1 1978.7 3134.5 5487.5 15526 140132.7
...p99 1043.3 2101.1 3464.1 6187.2 17505.6 157999.8
http.requests 911 1822 3644 7288 14576 91100
itwin.query_rows.response_time:
...min 7 5 5 4 9 40
...max 561 418 442 799 2200 13000
...median 18 37 62.2 108.9 295.9 2566.3
...p95 48.9 90.9 113.3 214.9 854.2 6976.1
...p99 247.2 247.2 242.3 450.4 1436.8 9801.2
itwin.schema_json.response_time:
...min 3 3 3 2 3 12
...max 276 227 271 549 1023 12011
...median 10.9 18 26.8 66 361.5 3534.1
...p95 45.2 172.5 135.7 295.9 788.5 9607.1
...p99 45.2 172.5 179.5 528.6 925.4 10832
vusers.completed 1 2 4 8 16 100
vusers.session_length:
...min 3356.6 4547.4 6962.8 12177.2 35856.4 293976.8
...max 3356.6 4767.3 7172.5 12509.9 37057 309541.8
...median 3328.3 4583.6 7117 12213.1 36691.5 305703.5
...p95 3328.3 4583.6 7117 12459.8 36691.5 311879.3
...p99 3328.3 4583.6 7117 12459.8 36691.5 311879.3

Interestingly, comparing to 4 query runner threads, the performance is much worse for 1-2 users cases, but when adding more users, the performance difference gets smaller.

2 backend processes

Metric 1 user 2 users 4 users 8 users 16 users 100 users
itwin.nodes_request.response_time:
...min 0 0 0 0 0 0
...max 718 1087 2112 5739 12850 127857
...median 1 2 2 2 1 1
...p95 620.3 963.1 1826.6 4867 10407.3 99741.2
...p99 699.4 1022.7 2059.5 5378.9 11971.2 114730.2
http.requests 911 1822 3644 7288 14576 91100
itwin.query_rows.response_time:
...min 3 4 4 4 5 11
...max 115 253 600 1313 4574 8753
...median 10.9 18 32.1 92.8 206.5 1790.4
...p95 37 40.9 98.5 232.8 278.7 4316.6
...p99 80.6 149.9 284.3 459.5 561.2 5168
itwin.schema_json.response_time:
...min 3 2 3 2 3 29
...max 90 219 375 422 827 6265
...median 6 7 13.9 30.3 90.9 2143.5
...p95 16.9 183.1 149.9 262.5 376.2 4770.6
...p99 16.9 183.1 333.7 415.8 497.8 5487.5
vusers.completed 1 2 4 8 16 100
vusers.session_length:
...min 1713.2 2686.4 4458.2 10242.3 19823.8 190318.8
...max 1713.2 2918 5286.4 11229.4 22866.6 206870.9
...median 1720.2 2671 5065.6 10617.5 21813.5 200858.7
...p95 1720.2 2671 5168 11050.8 22703.7 204916.5
...p99 1720.2 2671 5168 11050.8 22703.7 204916.5

Introducing an additional backend process substantially increased the performance over 1 backend process and even shows noticeable better results than 1 backend process with 4 query runner threads (total number of query runner threads is the same). This could be explained by 2 processes being able to do (de)serialization work in parallel while 1 backend process having to do that in serial manner. Overall, we still see a substantial performance degradation when users' count increases.

4 backend processes

Metric 1 user 2 users 4 users 8 users 16 users 100 users
itwin.nodes_request.response_time:
...min 0 0 0 0 0 0
...max 898 1112 2058 5186 11008 85060
...median 1 1 2 2 1 1
...p95 788.5 1022.7 1755 3984.7 8692.8 64236
...p99 889.1 1085.9 1978.7 4583.6 9801.2 75382
http.requests 911 1822 3644 7288 14576 91100
itwin.query_rows.response_time:
...min 5 4 6 5 5 6
...max 159 285 402 825 2361 7244
...median 13.9 18 32.1 73 169 1153.1
...p95 34.1 36.2 96.6 179.5 333.7 3752.7
...p99 96.6 165.7 301.9 327.1 713.5 4867
itwin.schema_json.response_time:
...min 2 2 3 2 3 84
...max 133 263 285 658 1275 6911
...median 7.9 7.9 15 27.9 77.5 2416.8
...p95 37.7 228.2 257.3 407.5 497.8 4965.3
...p99 37.7 228.2 278.7 620.3 1130.2 5711.5
vusers.completed 1 2 4 8 16 100
vusers.session_length:
...min 2189 2737.2 4842.5 8130.9 17726.8 151280.1
...max 2189 2835.1 5105.4 9991.2 19637.5 159771.2
...median 2186.8 2725 4867 9230.4 19346.7 154871.1
...p95 2186.8 2725 4965.3 9801.2 19737.6 157999.8
...p99 2186.8 2725 4965.3 9801.2 19737.6 157999.8

Doubling backend processes to 4, we see that performance for 1-4 users doesn't change, but is noticeably better with larger number of users.

8 backend processes

Metric 1 user 2 users 4 users 8 users 16 users 100 users
itwin.nodes_request.response_time:
...min 0 0 0 0 0 0
...max 848 1195 1955 4329 9363 77186
...median 1 2 2 2 1 1
...p95 742.6 1043.3 1720.2 3328.3 7260.8 59297.1
...p99 820.7 1153.1 1826.6 3752.7 8352 66857.6
http.requests 911 1822 3644 7288 14576 91100
itwin.query_rows.response_time:
...min 4 4 5 4 5 9
...max 172 211 696 1074 3995 5947
...median 13.1 19.1 23.8 50.9 135.7 1107.9
...p95 23.8 41.7 135.7 232.8 432.7 2951.9
...p99 117.9 159.2 295.9 407.5 742.6 3678.4
itwin.schema_json.response_time:
...min 3 3 2 3 7 41
...max 94 139 427 693 1588 4701
...median 7 7.9 12.1 30.3 83.9 1686.1
...p95 19.1 135.7 242.3 399.5 1002.4 3197.8
...p99 19.1 135.7 340.4 645.6 1408.4 3752.7
vusers.completed 1 2 4 8 16 100
vusers.session_length:
...min 1929.9 2808.3 5061 8212.3 16228.9 108969.4
...max 1929.9 2871.2 5124.1 8869.3 19919.2 136988.6
...median 1939.5 2836.2 5065.6 8692.8 18588.1 131971.7
...p95 1939.5 2836.2 5065.6 8868.4 19737.6 137357.8
...p99 1939.5 2836.2 5065.6 8868.4 19737.6 137357.8

Doubling backend processes to 8, we see that performance for 1-16 users doesn't change, but is slightly better with 100 users.

Summary

Here we take the cases with 100 users from the results from previous sections to better see how they change as the number of backend processes increases.

Metric 1 backend 2 backends 4 backends 8 backends
itwin.nodes_request.response_time:
...min 0 0 0 0
...max 183221 127857 85060 77186
...median 1 1 1 1
...p95 140132.7 99741.2 64236 59297.1
...p99 157999.8 114730.2 75382 66857.6
http.requests 91100 91100 91100 91100
itwin.query_rows.response_time:
...min 40 11 6 9
...max 13000 8753 7244 5947
...median 2566.3 1790.4 1153.1 1107.9
...p95 6976.1 4316.6 3752.7 2951.9
...p99 9801.2 5168 4867 3678.4
itwin.schema_json.response_time:
...min 12 29 84 41
...max 12011 6265 6911 4701
...median 3534.1 2143.5 2416.8 1686.1
...p95 9607.1 4770.6 4965.3 3197.8
...p99 10832 5487.5 5711.5 3752.7
vusers.completed 100 100 100 100
vusers.session_length:
...min 293976.8 190318.8 151280.1 108969.4
...max 309541.8 206870.9 159771.2 136988.6
...median 305703.5 200858.7 154871.1 131971.7
...p95 311879.3 204916.5 157999.8 137357.8
...p99 311879.3 204916.5 157999.8 137357.8

The table clearly shows that scaling the backend does help - all the times get reduced. And as mentioned earlier, the effect should be even higher the processes run on different machines.

Native implementation

In this section we're comparing how performance of the native implementation is affected by scaling the backend for 16 users.

Metric 1 backend 2 backends 4 backends
itwin.nodes_request.response_time:
...min 38 151 102
...max 283461 321151 366406
...median 86710.4 70992 75382
...p95 164448.1 137357.8 140132.7
...p99 189161.2 148798.3 154871.1
http.requests 32678 32791 32692
vusers.completed 16 16 16
vusers.session_length:
...min 292600 324991.4 332269.2
...max 431296.9 407962.6 408147.4
...median 412660.7 396479.5 404489.2
...p95 429502.3 404489.2 404489.2
...p99 429502.3 404489.2 404489.2

The table clearly shows that increasing the number of backend processes has little to no effect on performance for the end users.

grigasp commented 1 year ago

Initial load tests

The tests mimic the Models Tree initial load tests that are run as part of iTwin Platform visualization performance test suite. The goal is to be under 5 seconds here and we see that this is mostly achieved except for just a couple of cases where it's not achieved with native implementation either (in cold cache situation).

Overall, the native implementation in warm cache situations is the fastest, but it's not guaranteed - end users see a mix of cold and warm cache situations. The stateless implementation, on the other hand, doesn't involve any caches, so it's performance is expected to be more stable. Comparing stateless vs native cold cache implementations, we see that stateless for the most part performs better except a few cases.

iModel Stateless Native (cold cache) Native (warm cache)
S - 1 159.2 232.8 71.5
S - 2 149.9 295.9 79.1
S - 3 159.2 290.1 106.7
S - 4 159.2 295.9 87.4
S - 5 156 361.5 125.2
S - 6 169 441.5 111.1
S - 7 147 584.2 102.5
M - 1 156 242.3 71.5
M - 2 172.5 383.8 111.1
M - 3 169 333.7 82.3
M - 4 273.2 497.8 79.1
M - 5 347.3 889.1 104.6
M - 6 156 262.5 73
M - 7 1224.4 2416.8 290.1
M - 8 165.7 301.9 87.4
M - 9 159.2 262.5 83.9
M - 10 175.9 327.1 79.1
M - 11 165.7 407.5 102.5
M - 12 156 383.8 85.6
M - 13 159.2 320.6 79.1
M - 14 214.9 528.6 90.9
M - 15 214.9 278.7 71.5
L - 1 6838 6312.2 198.4
L - 2 1408.4 1130.2 111.1
L - 3 4147.4 4231.1 90.9
L - 4 1380.5 1436.8 115.6
L - 5 699.4 459.5 80.6
L - 6 632.8 528.6 102.5
L - 7 172.5 424.2 77.5
L - 8 314.2 788.5 113.3
L - 9 327.1 742.6 144
L - 10 262.5 713.5 106.7
XL - 1 135.7 399.5 98.5
XL - 2 820.7 1274.3 133
XL - 3 671.9 742.6 82.3
XL - 4 3134.5 1495.5 144
XL - 5 561.2 2893.5 273.2
XL - 6 36691.5 24594.7 4065.2
XL - 7 478.3 1587.9 295.9
grigasp commented 1 year ago

Conclusion

The stateless implementation shows not only order of magnitude better performance when creating full Models Tree hierarchy, compared to the native implementation, but also outperforms it in majority of Models Tree initial load test cases when compared to native implementation in cold cache situation.

Furthermore, scalability tests show that the stateless implementation scales much better, substantially improving request performance as the number of backend processes grows. The native implementation, on the other hand, shows nearly no improvement.

The stateless implementation outperforms the native one because of:

grigasp commented 1 year ago

Follow up items