Proof of concept for stateless presentation-driven tree data provider

grigasp commented 1 year ago

The load test results have multiple metrics, most important of which, I believe are: itwin.nodes_request.response_time and vusers.session_length. The stateless implementation makes two type of requests when creating nodes: "query rows" and "schema json", so it's also interesting to see how long they take (represented by itwin.query_rows.response_time and itwin.schema_json.response_time metrics).

Nodes response time (itwin.nodes_request.response_time) shows how long a user has to wait to get the child nodes for a parent. In case of the stateless implementation this is different from just making a request, because creating nodes may involve making more than 1 request and additional post-processing (sorting, grouping, etc.). In case of the the native implementation, everything's done on the backend, so the timings for making the request nearly match the timings of creating the nodes.

Virtual user session length (vusers.session_length) shows how long it takes a user to load all the hierarchy.

Ideally, we want those metrics to be affected by adding additional virtual users as little as possible.

grigasp commented 1 year ago

Best case scenario

In this situation there's one backend process (with default config) handling all requests of a single user. This should provide a good sense of the best possible performance we can expect from each implementation.

Metric	Stateless	Native (cold cache)	Native (warm cache)
itwin.nodes_request.response_time:
...min	0	50	23
...max	802	10809	4609
...median	1	7709.8	2671
...p95	713.5	10201.2	4583.6
...p99	788.5	10617.5	4583.6
http.requests	911	2018	2018
vusers.session_length:
...min	1900.6	47225.8	16710.9
...max	1900.6	47225.8	16710.9
...median	1901.1	47586.7	16819.2
...p95	1901.1	47586.7	16819.2
...p99	1901.1	47586.7	16819.2
stats:
...numbers
...graphs

The results show that the stateless implementation outperforms the native one even when using the warm cache. This is mostly because of smaller number of requests it has to make to create the hierarchy - the stateless implementation is built with a hard limit of nodes it loads for a single hierarchy level, which allows it to do some optimizations like perform grouping on the frontend rather than the backend, which reduces the number of requests it has to make against the backend.

In addition, the stateless implementation requires much less resources (see Peak Private Bytes, I/O Reads, I/O Writes in the stat images).

grigasp commented 1 year ago

Load tests

We're comparing how performance of creating the full Models tree hierarchy is affected by the number of users simultaneously creating it.

Stateless implementation

We're checking how the stateless library implementation performs with default concurrent query config (4 query runner threads).

Metric	1 user	2 users	4 users	8 users	16 users	100 users
itwin.nodes_request.response_time:
...min	0	0	0	0	0	0
...max	802	1532	3225	6817	16515	153334
...median	1	1	1	1	1	1
...p95	713.5	1408.4	2618.1	5378.9	13230.3	131971.7
...p99	788.5	1495.5	3011.6	6187.2	14917.2	145851.8
http.requests	911	1822	3644	7288	14576	91100
itwin.query_rows.response_time:
...min	5	4	4	5	8	31
...max	143	219	362	620	1403	13403
...median	12.1	25.8	51.9	106.7	257.3	2186.8
...p95	43.4	83.9	102.5	198.4	632.8	5378.9
...p99	104.6	156	186.8	518.1	1200.1	9801.2
itwin.schema_json.response_time:
...min	2	2	2	2	4	185
...max	75	116	151	428	974	12509
...median	13.9	16.9	29.1	71.5	278.7	3262.4
...p95	24.8	104.6	130.3	320.6	620.3	9607.1
...p99	24.8	104.6	147	424.2	804.5	11274.1
vusers.completed	1	2	4	8	16	100
vusers.session_length:
...min	1900.6	3631.3	5689.6	12396	29446.7	258792.5
...max	1900.6	3640.3	6058.8	12713.3	30327.5	265409.3
...median	1901.1	3605.5	5944.6	12459.8	30040.3	260502
...p95	1901.1	3605.5	6064.7	12711.5	30040.3	265764.6
...p99	1901.1	3605.5	6064.7	12711.5	30040.3	265764.6

The results show that doubling users' count nearly doubles the time it takes for them to create the hierarchy.

Native implementation

Metric	1 user	2 users	4 users	8 users	16 users
itwin.nodes_request.response_time:
...min	29	31	32	27	38
...max	14557	28824	44401	148789	283461
...median	12968.3	18220	30040.3	61717.2	86710.4
...p95	14048.5	27181.5	42205.5	84993.4	164448.1
...p99	14332.3	28290.8	43058.1	90249.2	189161.2
http.requests	2019	4039	8083	16261	32678
vusers.completed	1	2	4	8	16
vusers.session_length:
...min	56008.5	86025.5	116437.6	223493.6	292600
...max	56008.5	86097.9	119656.8	236531.4	431296.9
...median	55843.8	86710.4	117048	231043.6	412660.7
...p95	55843.8	86710.4	117048	235711.1	429502.3
...p99	55843.8	86710.4	117048	235711.1	429502.3

Similar to the stateless implementation, the performance degrades as the number of users grows.

grigasp commented 1 year ago

Scalability tests

Stateless implementation

Because scalability is being tested on a single machine with limited number of cores for multithreading, we want to reduce the number of worker threads used for running the queries so we can see how increasing the number of processes affects the performance. Otherwise, additional processes will have to share the same processor cores and increasing processes count won't have the desired effect. The minimum number of query runner threads is 2, so we're using that.

In the real world application the backend processes could run on separate machines, so we would have the benefits of both large number of threads per process and large number of processes without them needing to share the same resources.

1 backend process

Metric	1 user	2 users	4 users	8 users	16 users	100 users
itwin.nodes_request.response_time:
...min	0	0	0	0	0	0
...max	1080	2194	3636	6820	19499	183221
...median	1	2	1	1	1	1
...p95	963.1	1978.7	3134.5	5487.5	15526	140132.7
...p99	1043.3	2101.1	3464.1	6187.2	17505.6	157999.8
http.requests	911	1822	3644	7288	14576	91100
itwin.query_rows.response_time:
...min	7	5	5	4	9	40
...max	561	418	442	799	2200	13000
...median	18	37	62.2	108.9	295.9	2566.3
...p95	48.9	90.9	113.3	214.9	854.2	6976.1
...p99	247.2	247.2	242.3	450.4	1436.8	9801.2
itwin.schema_json.response_time:
...min	3	3	3	2	3	12
...max	276	227	271	549	1023	12011
...median	10.9	18	26.8	66	361.5	3534.1
...p95	45.2	172.5	135.7	295.9	788.5	9607.1
...p99	45.2	172.5	179.5	528.6	925.4	10832
vusers.completed	1	2	4	8	16	100
vusers.session_length:
...min	3356.6	4547.4	6962.8	12177.2	35856.4	293976.8
...max	3356.6	4767.3	7172.5	12509.9	37057	309541.8
...median	3328.3	4583.6	7117	12213.1	36691.5	305703.5
...p95	3328.3	4583.6	7117	12459.8	36691.5	311879.3
...p99	3328.3	4583.6	7117	12459.8	36691.5	311879.3

Interestingly, comparing to 4 query runner threads, the performance is much worse for 1-2 users cases, but when adding more users, the performance difference gets smaller.

2 backend processes

Metric	1 user	2 users	4 users	8 users	16 users	100 users
itwin.nodes_request.response_time:
...min	0	0	0	0	0	0
...max	718	1087	2112	5739	12850	127857
...median	1	2	2	2	1	1
...p95	620.3	963.1	1826.6	4867	10407.3	99741.2
...p99	699.4	1022.7	2059.5	5378.9	11971.2	114730.2
http.requests	911	1822	3644	7288	14576	91100
itwin.query_rows.response_time:
...min	3	4	4	4	5	11
...max	115	253	600	1313	4574	8753
...median	10.9	18	32.1	92.8	206.5	1790.4
...p95	37	40.9	98.5	232.8	278.7	4316.6
...p99	80.6	149.9	284.3	459.5	561.2	5168
itwin.schema_json.response_time:
...min	3	2	3	2	3	29
...max	90	219	375	422	827	6265
...median	6	7	13.9	30.3	90.9	2143.5
...p95	16.9	183.1	149.9	262.5	376.2	4770.6
...p99	16.9	183.1	333.7	415.8	497.8	5487.5
vusers.completed	1	2	4	8	16	100
vusers.session_length:
...min	1713.2	2686.4	4458.2	10242.3	19823.8	190318.8
...max	1713.2	2918	5286.4	11229.4	22866.6	206870.9
...median	1720.2	2671	5065.6	10617.5	21813.5	200858.7
...p95	1720.2	2671	5168	11050.8	22703.7	204916.5
...p99	1720.2	2671	5168	11050.8	22703.7	204916.5

Introducing an additional backend process substantially increased the performance over 1 backend process and even shows noticeable better results than 1 backend process with 4 query runner threads (total number of query runner threads is the same). This could be explained by 2 processes being able to do (de)serialization work in parallel while 1 backend process having to do that in serial manner. Overall, we still see a substantial performance degradation when users' count increases.

4 backend processes

Metric	1 user	2 users	4 users	8 users	16 users	100 users
itwin.nodes_request.response_time:
...min	0	0	0	0	0	0
...max	898	1112	2058	5186	11008	85060
...median	1	1	2	2	1	1
...p95	788.5	1022.7	1755	3984.7	8692.8	64236
...p99	889.1	1085.9	1978.7	4583.6	9801.2	75382
http.requests	911	1822	3644	7288	14576	91100
itwin.query_rows.response_time:
...min	5	4	6	5	5	6
...max	159	285	402	825	2361	7244
...median	13.9	18	32.1	73	169	1153.1
...p95	34.1	36.2	96.6	179.5	333.7	3752.7
...p99	96.6	165.7	301.9	327.1	713.5	4867
itwin.schema_json.response_time:
...min	2	2	3	2	3	84
...max	133	263	285	658	1275	6911
...median	7.9	7.9	15	27.9	77.5	2416.8
...p95	37.7	228.2	257.3	407.5	497.8	4965.3
...p99	37.7	228.2	278.7	620.3	1130.2	5711.5
vusers.completed	1	2	4	8	16	100
vusers.session_length:
...min	2189	2737.2	4842.5	8130.9	17726.8	151280.1
...max	2189	2835.1	5105.4	9991.2	19637.5	159771.2
...median	2186.8	2725	4867	9230.4	19346.7	154871.1
...p95	2186.8	2725	4965.3	9801.2	19737.6	157999.8
...p99	2186.8	2725	4965.3	9801.2	19737.6	157999.8

Doubling backend processes to 4, we see that performance for 1-4 users doesn't change, but is noticeably better with larger number of users.

8 backend processes

Metric	1 user	2 users	4 users	8 users	16 users	100 users
itwin.nodes_request.response_time:
...min	0	0	0	0	0	0
...max	848	1195	1955	4329	9363	77186
...median	1	2	2	2	1	1
...p95	742.6	1043.3	1720.2	3328.3	7260.8	59297.1
...p99	820.7	1153.1	1826.6	3752.7	8352	66857.6
http.requests	911	1822	3644	7288	14576	91100
itwin.query_rows.response_time:
...min	4	4	5	4	5	9
...max	172	211	696	1074	3995	5947
...median	13.1	19.1	23.8	50.9	135.7	1107.9
...p95	23.8	41.7	135.7	232.8	432.7	2951.9
...p99	117.9	159.2	295.9	407.5	742.6	3678.4
itwin.schema_json.response_time:
...min	3	3	2	3	7	41
...max	94	139	427	693	1588	4701
...median	7	7.9	12.1	30.3	83.9	1686.1
...p95	19.1	135.7	242.3	399.5	1002.4	3197.8
...p99	19.1	135.7	340.4	645.6	1408.4	3752.7
vusers.completed	1	2	4	8	16	100
vusers.session_length:
...min	1929.9	2808.3	5061	8212.3	16228.9	108969.4
...max	1929.9	2871.2	5124.1	8869.3	19919.2	136988.6
...median	1939.5	2836.2	5065.6	8692.8	18588.1	131971.7
...p95	1939.5	2836.2	5065.6	8868.4	19737.6	137357.8
...p99	1939.5	2836.2	5065.6	8868.4	19737.6	137357.8

Doubling backend processes to 8, we see that performance for 1-16 users doesn't change, but is slightly better with 100 users.

Summary

Here we take the cases with 100 users from the results from previous sections to better see how they change as the number of backend processes increases.

Metric	1 backend	2 backends	4 backends	8 backends
itwin.nodes_request.response_time:
...min	0	0	0	0
...max	183221	127857	85060	77186
...median	1	1	1	1
...p95	140132.7	99741.2	64236	59297.1
...p99	157999.8	114730.2	75382	66857.6
http.requests	91100	91100	91100	91100
itwin.query_rows.response_time:
...min	40	11	6	9
...max	13000	8753	7244	5947
...median	2566.3	1790.4	1153.1	1107.9
...p95	6976.1	4316.6	3752.7	2951.9
...p99	9801.2	5168	4867	3678.4
itwin.schema_json.response_time:
...min	12	29	84	41
...max	12011	6265	6911	4701
...median	3534.1	2143.5	2416.8	1686.1
...p95	9607.1	4770.6	4965.3	3197.8
...p99	10832	5487.5	5711.5	3752.7
vusers.completed	100	100	100	100
vusers.session_length:
...min	293976.8	190318.8	151280.1	108969.4
...max	309541.8	206870.9	159771.2	136988.6
...median	305703.5	200858.7	154871.1	131971.7
...p95	311879.3	204916.5	157999.8	137357.8
...p99	311879.3	204916.5	157999.8	137357.8

The table clearly shows that scaling the backend does help - all the times get reduced. And as mentioned earlier, the effect should be even higher the processes run on different machines.

Native implementation

In this section we're comparing how performance of the native implementation is affected by scaling the backend for 16 users.

Metric	1 backend	2 backends	4 backends
itwin.nodes_request.response_time:
...min	38	151	102
...max	283461	321151	366406
...median	86710.4	70992	75382
...p95	164448.1	137357.8	140132.7
...p99	189161.2	148798.3	154871.1
http.requests	32678	32791	32692
vusers.completed	16	16	16
vusers.session_length:
...min	292600	324991.4	332269.2
...max	431296.9	407962.6	408147.4
...median	412660.7	396479.5	404489.2
...p95	429502.3	404489.2	404489.2
...p99	429502.3	404489.2	404489.2

The table clearly shows that increasing the number of backend processes has little to no effect on performance for the end users.

grigasp commented 1 year ago

Initial load tests

The tests mimic the Models Tree initial load tests that are run as part of iTwin Platform visualization performance test suite. The goal is to be under 5 seconds here and we see that this is mostly achieved except for just a couple of cases where it's not achieved with native implementation either (in cold cache situation).

Overall, the native implementation in warm cache situations is the fastest, but it's not guaranteed - end users see a mix of cold and warm cache situations. The stateless implementation, on the other hand, doesn't involve any caches, so it's performance is expected to be more stable. Comparing stateless vs native cold cache implementations, we see that stateless for the most part performs better except a few cases.

iModel	Stateless	Native (cold cache)	Native (warm cache)
S - 1	159.2	232.8	71.5
S - 2	149.9	295.9	79.1
S - 3	159.2	290.1	106.7
S - 4	159.2	295.9	87.4
S - 5	156	361.5	125.2
S - 6	169	441.5	111.1
S - 7	147	584.2	102.5
M - 1	156	242.3	71.5
M - 2	172.5	383.8	111.1
M - 3	169	333.7	82.3
M - 4	273.2	497.8	79.1
M - 5	347.3	889.1	104.6
M - 6	156	262.5	73
M - 7	1224.4	2416.8	290.1
M - 8	165.7	301.9	87.4
M - 9	159.2	262.5	83.9
M - 10	175.9	327.1	79.1
M - 11	165.7	407.5	102.5
M - 12	156	383.8	85.6
M - 13	159.2	320.6	79.1
M - 14	214.9	528.6	90.9
M - 15	214.9	278.7	71.5
L - 1	6838	6312.2	198.4
L - 2	1408.4	1130.2	111.1
L - 3	4147.4	4231.1	90.9
L - 4	1380.5	1436.8	115.6
L - 5	699.4	459.5	80.6
L - 6	632.8	528.6	102.5
L - 7	172.5	424.2	77.5
L - 8	314.2	788.5	113.3
L - 9	327.1	742.6	144
L - 10	262.5	713.5	106.7
XL - 1	135.7	399.5	98.5
XL - 2	820.7	1274.3	133
XL - 3	671.9	742.6	82.3
XL - 4	3134.5	1495.5	144
XL - 5	561.2	2893.5	273.2
XL - 6	36691.5	24594.7	4065.2
XL - 7	478.3	1587.9	295.9

grigasp commented 1 year ago

Conclusion

The stateless implementation shows not only order of magnitude better performance when creating full Models Tree hierarchy, compared to the native implementation, but also outperforms it in majority of Models Tree initial load test cases when compared to native implementation in cold cache situation.

Furthermore, scalability tests show that the stateless implementation scales much better, substantially improving request performance as the number of backend processes grows. The native implementation, on the other hand, shows nearly no improvement.

The stateless implementation outperforms the native one because of:

Hardcoded hierarchy level size limit. This allows to:
- Avoid running the "count" query.
- Avoid sorting on the backend and avoid it altogether when the hierarchy level is too large.
- Avoid grouping on the backend and avoid it altogether when the hierarchy level is too large.
- Reduce the number of requests.
Lack of hierarchy cache - no need to spend time caching nodes.
There's no need to parse presentation rules on each request.

grigasp commented 1 year ago

Follow up items

May want to run the tests against a deployed backend.

iTwin / presentation