Allow the filters `annotations` and `labels` to retrieve resources

maruiz93 commented 1 month ago

As a user I want to be able to filter the retrieved resources by annotations and/or labels.

Acceptance Criteria

The different endpoints (except the queries by name) should accept the queryParams annotationSelector and labelSelector
Both annotationSelector and labelSelector will accept a list of strings in the form of key=value, e.g: /api/v1/pods?labelSelector=key2=Dvalue2,key1=value1. This is the same way that Kubernetes API handle the labelSelectors.
The retrieved resources should be the only ones that contain among their annotations/labels the ones provided in the filter

skoved commented 1 month ago

whatever query(ies) that we end up using to retrieve the resource from the database based on labels or annotations needs to be performant when there are 100,000 or 1,000,000 resources in the database. There are 2 approaches that I've thought about for making this performant:

Indexing labels - this would require the cluster admin to specify what labels should be indexed
Adding a label table and label_resource junction table - this would make the database schema more complex but would potentially allow for optimization that does not require user input when setting up the database

rh-hemartin commented 1 month ago

We already are indexing on labels, we implemented them at #517 . These are the results with a dump from a production DB (50k pipeline runs, 400k taskruns, filtering by label on the taskruns):

kubearchive=# SELECT COUNT(*) FROM resource WHERE kind='PipelineRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci';
 count 
-------
 49704
(1 row)

kubearchive=# SELECT COUNT(*) FROM resource WHERE kind='TaskRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci';
 count  
--------
 405240
(1 row)

kubearchive=# EXPLAIN ANALYZE SELECT name FROM resource WHERE kind='TaskRun' AND api_version='tekton.dev/v1beta1' AND namespace='tekton-ci' AND data->'metadata'->'labels' @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}';
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on resource  (cost=23.34..1579.02 rows=9 width=49) (actual time=0.070..0.349 rows=7 loops=1)
   Recheck Cond: (((data -> 'metadata'::text) -> 'labels'::text) @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}'::jsonb)
   Filter: (((kind)::text = 'TaskRun'::text) AND ((api_version)::text = 'tekton.dev/v1beta1'::text) AND ((namespace)::text = 'tekton-ci'::text))
   Heap Blocks: exact=5
   ->  Bitmap Index Scan on idx_json_labels  (cost=0.00..23.34 rows=395 width=0) (actual time=0.013..0.013 rows=7 loops=1)
         Index Cond: (((data -> 'metadata'::text) -> 'labels'::text) @> '{"tekton.dev/pipelineRun": "build-definitions-pull-request-l8hl5"}'::jsonb)
 Planning Time: 0.248 ms
 Execution Time: 0.370 ms
(8 rows)

maruiz93 commented 1 month ago

In https://github.com/kubearchive/kubearchive/pull/518 @ggallen added the indexes for annotations and labels. @skoved do you think that those generic labels shouldn't be enough?

EDIT: my point was already made by @rh-hemartin , we commented at the same time but he was a little faster :)

ggallen commented 1 month ago

See also this thread, where I have an example of a label query against the 7.8 million records we got from Tekton Results. Performance shouldn't be an issue.

Repeating it here in public:

I have added these three indexes for JSON.  I believe these cover our current use cases.

CREATE INDEX IF NOT EXISTS idx_json_annotations ON public.resource USING gin ((data->'metadata'->'annotations') jsonb_path_ops);
CREATE INDEX IF NOT EXISTS idx_json_labels ON public.resource USING gin ((data->'metadata'->'labels') jsonb_path_ops);
CREATE INDEX IF NOT EXISTS idx_json_owners ON public.resource USING gin ((data->'metadata'->'ownerReferences') jsonb_path_ops);

Loading our 7.8 million records is fast:

real    40m23.834s
user    0m43.411s
sys    0m47.307s

Retrieving data is straightforward and fast. Here's an example querying for all resource that have the label tekton-dev/pipeline: build-definitions-pull-request-l8hl5 :

kubearchive=> SELECT uuid FROM public.resource WHERE data->'metadata'->'labels' @> '{"tekton.dev/pipeline": "build-definitions-pull-request-l8hl5"}';
                 uuid                 
--------------------------------------
 44da663d-6e10-424e-9786-4777ed0389bd
 00b606ce-4cd6-428b-a9e3-e32d1abddce0
 f40c45c7-b70d-4855-a2be-0868aa5c0c6b
 2d893024-4ed4-4318-a12d-13cb7f6562bc
 32a1bfb9-2cb9-49c1-b850-a18aef6f51c4
 15990af8-ee88-4697-991b-6d611f20f036
 98a82765-23c3-4b67-a827-d251a6ef052d
 d00b76f1-fea7-4079-a316-5431ccbcb578
(8 rows)

Time: 4.680 ms
kubearchive=> 

Annotations and owners can be done similarly.

I don't think performance is an issue here.

skoved commented 1 month ago

feel free to ignore my last comment. I didn't realize that we had set up indexing for data.metadata.labels and determined that it was performant enough for now

kubearchive / kubearchive

Allow the filters `annotations` and `labels` to retrieve resources #567

Acceptance Criteria