Open maruiz93 opened 1 month ago
whatever query(ies) that we end up using to retrieve the resource from the database based on labels or annotations needs to be performant when there are 100,000 or 1,000,000 resources in the database. There are 2 approaches that I've thought about for making this performant:
We already are indexing on labels, we implemented them at #517 . These are the results with a dump from a production DB (50k pipeline runs, 400k taskruns, filtering by label on the taskruns):
kubearchive=# SELECT COUNT(*) FROM resource WHERE kind='PipelineRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci';
count
-------
49704
(1 row)
kubearchive=# SELECT COUNT(*) FROM resource WHERE kind='TaskRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci';
count
--------
405240
(1 row)
kubearchive=# EXPLAIN ANALYZE SELECT name FROM resource WHERE kind='TaskRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci' AND data->'metadata'->'labels' @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on resource (cost=23.34..1579.02 rows=9 width=49) (actual time=0.070..0.349 rows=7 loops=1)
Recheck Cond: (((data -> 'metadata'::text) -> 'labels'::text) @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}'::jsonb)
Filter: (((kind)::text = 'TaskRun'::text) AND ((api_version)::text = 'tekton.dev/v1beta1'::text) AND ((namespace)::text = 'tekton-ci'::text))
Heap Blocks: exact=5
-> Bitmap Index Scan on idx_json_labels (cost=0.00..23.34 rows=395 width=0) (actual time=0.013..0.013 rows=7 loops=1)
Index Cond: (((data -> 'metadata'::text) -> 'labels'::text) @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}'::jsonb)
Planning Time: 0.248 ms
Execution Time: 0.370 ms
(8 rows)
In https://github.com/kubearchive/kubearchive/pull/518 @ggallen added the indexes for annotations and labels. @skoved do you think that those generic labels shouldn't be enough?
EDIT: my point was already made by @rh-hemartin , we commented at the same time but he was a little faster :)
See also this thread, where I have an example of a label query against the 7.8 million records we got from Tekton Results. Performance shouldn't be an issue.
Repeating it here in public:
I have added these three indexes for JSON. I believe these cover our current use cases.
CREATE INDEX IF NOT EXISTS idx_json_annotations ON public.resource USING gin ((data->'metadata'->'annotations') jsonb_path_ops);
CREATE INDEX IF NOT EXISTS idx_json_labels ON public.resource USING gin ((data->'metadata'->'labels') jsonb_path_ops);
CREATE INDEX IF NOT EXISTS idx_json_owners ON public.resource USING gin ((data->'metadata'->'ownerReferences') jsonb_path_ops);
Loading our 7.8 million records is fast:
real 40m23.834s
user 0m43.411s
sys 0m47.307s
Retrieving data is straightforward and fast. Here's an example querying for all resource that have the label tekton-dev/pipeline: build-definitions-pull-request-l8hl5 :
kubearchive=> SELECT uuid FROM public.resource WHERE data->'metadata'->'labels' @> '{"tekton.dev/pipeline": "build-definitions-pull-request-l8hl5"}';
uuid
--------------------------------------
44da663d-6e10-424e-9786-4777ed0389bd
00b606ce-4cd6-428b-a9e3-e32d1abddce0
f40c45c7-b70d-4855-a2be-0868aa5c0c6b
2d893024-4ed4-4318-a12d-13cb7f6562bc
32a1bfb9-2cb9-49c1-b850-a18aef6f51c4
15990af8-ee88-4697-991b-6d611f20f036
98a82765-23c3-4b67-a827-d251a6ef052d
d00b76f1-fea7-4079-a316-5431ccbcb578
(8 rows)
Time: 4.680 ms
kubearchive=>
Annotations and owners can be done similarly.
I don't think performance is an issue here.
feel free to ignore my last comment. I didn't realize that we had set up indexing for data.metadata.labels
and determined that it was performant enough for now
As a user I want to be able to filter the retrieved resources by
annotations
and/orlabels
.Acceptance Criteria
annotationSelector
andlabelSelector
annotationSelector
andlabelSelector
will accept a list of strings in the form of key=value, e.g:/api/v1/pods?labelSelector=key2=Dvalue2,key1=value1
. This is the same way that Kubernetes API handle the labelSelectors.