Performance Improvement Proposal for queries related with the Task Table

peresureda commented 6 months ago

**Is your feature request related to a problem? Please describe.** When you have the Trivy scanner activated and want to perform one of the following operations:

List the artifacts of a project.
Execute a ScanAll.

The following query is executed in the database:

SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as jsonb )

This query becomes very heavy for the database CPU, and in some cases, Harbor timeouts when the number of artifacts is large.

This is because the extra_attrs field is of type JSON and cannot have any kind of index on it.

Solution Proposal

Since the JSON and JSONB data types accept the same input and produce the same output, my recommendation is to change the data type to JSONB to allow indexing:

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);

Improvements With this improvement, we have achieved the following results:

Reduced artifact listing time from 45 seconds to 2 seconds.
Reduced scan time by 80%.
Reduced CPU consumption during a ScanAll from 60% to 25%.

Vad1mo commented 6 months ago

Thank you for the improvement, this makes sense.

one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb convert existing data?

peresureda commented 6 months ago

Hi @Vad1mo ,

Important points:

JSONB available from: PostgreSQL 9.4 (2014) added support for JSONB data type.
PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table. create table task_aux as select * from task;

flbla commented 6 months ago

Hi, we probably have the same issue. since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API we saw huge CPU usage on our postgres database since the update It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118

explain command on one of the select * from task :

difference in CPU usage between today (yellow line) and previous week (gray dotted line) :

peresureda commented 6 months ago

Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.

flbla commented 6 months ago

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

AlenversFr commented 6 months ago

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

Query analysis :

Before applying the fix :


                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | now()
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```

when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index  "idx_task_extra_attrs_report_uuids"  defined with task table.

```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```

If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.

After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```

Here is the result of the explain command : 
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=253)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```

In our staging environment here is the performance impact of the modification : 
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query

When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.

Modifying the type of the column means we have to check all the internal services  of Harbor that might use this table.
It's not easy to evaluate the impact.

Here is a query that helps to validate the correct interactions between task and execution tables : 
```WITH task_error as (                                          
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```

 In order to validate the performance fix  :
 This query should indicates no regression for every executions and tasks dependencies.

AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.

```   vendor_type     |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
--------------------+------------+---------+--------------+----------------+------------------+------------------
 EXECUTION_SWEEP    | 2024-05-31 | Success |            3 |                |                3 |                 
 GARBAGE_COLLECTION | 2024-05-31 | Success |            1 |                |                1 |                 
 IMAGE_SCAN         | 2024-05-31 | Running |            7 |                |                  |                 
 REPLICATION        | 2024-05-31 | Success |           10 |                |               16 |                 
 SCAN_ALL           | 2024-05-31 | Running |            1 |                |             2082 |                 
 SCHEDULER          | 2024-05-31 | Running |            1 |                |                  |  ```

 An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
 Example of configuration with a single check on a label :

![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)

The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.

Vad1mo commented 5 months ago

@stonezdj @wy65701436, thank you for the swift fix of the issues.

It is not clear to me why, in addition to changing the query, we can not change the Data Type for generally a better performance?

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);

peresureda commented 5 months ago

We are reverting to the previous situation where we are casting the object on each execution in order to use an index. Tests show that CPU consumption decreased exponentially by avoiding this cast and converting the data type. Am I missing any place where you need this field to be in JSON and cannot be migrated to JSONB?

@stonezdj , @wy65701436 ¿?

Vad1mo commented 5 months ago

I would also suggest to transition to jsob data type, it is clearly recommended option for dealing with json in pg, specifically for performance reasons. There are also other opertunities to switch to jsonb.

stonezdj commented 5 months ago

The PR merged in 2.11 just fixed the index not used issue, for converting json to jsonb type and more index, we need to do more testing to verify it.

wy65701436 commented 5 months ago

Since we didn't encounter any performance issues prior to v2.10.1, the index on report_uuids has proven beneficial. The recent commit did not leverage this existing index, leading to performance problems.

The fix is to rewrite the SQL to ensure the DB engine utilizes the report_uuids index. This approach avoids updating the DB schema, which would require additional upgrade work and testing

This fix will be in v2.9.5, v2.10.3 and v2.11.0.

Vad1mo commented 5 months ago

@stonezdj @wy65701436, thank you for more info, as a quick fix this is a good way forward.

However, I am quite confident, that we can optimize this further.

@flbla @thcdrt @peresureda, you have a rather large data sets, can you provide some data points.

Switch to DATA TYPE jsonb (we have already datapoints here)
Applied fix #20545
Applied fix #20545 + change of DATA TYPE to jsonb.

This will also open up opportunities for further improvements, as have other json data types present, that we query.

peresureda commented 5 months ago

Hi @Vad1mo

This morning I have done extra tests with the different scenarios commented and I can provide more data:

Applied fix [https://github.com/goharbor/harbor/pull/20545]: In this case, we return to a previous scenario where in our current system, listing the artifacts of a project with around 50 artifacts including information about the CVEs takes us between 7 and 10 seconds.

Execution Plan:

explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'

"Bitmap Heap Scan on task  (**cost=68.01..75.71** rows=2 width=220)"
"  Recheck Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on idx_task_extra_attrs_report_uuids  (cost=0.00..68.01 rows=2 width=0)"
"        Index Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

Applied fix https://github.com/goharbor/harbor/pull/20545 + change of DATA TYPE to jsonb: This case doesn't make sense because even if we change the data type, if the query is the provided to the fix, the performance is the same. This is because the way the query is set up doesn't use the JSONB index.

explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'

"Bitmap Heap Scan on task  (**cost=72.01..79.73** rows=2 width=250)"
"  Recheck Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on idx_task_extra_attrs_report_uuids  (cost=0.00..72.01 rows=2 width=0)"
"        Index Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

Proposal: Looking at all the results, the real performance improvement is with the previous query and the proposed improvement, as we utilize the JSONB enhancements provided by PostgreSQL. In this case, the previous ListAll is reduced to 2-3 seconds.

explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast( '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}' as jsonb )

"Bitmap Heap Scan on task  (cost=40.01..47.72 rows=2 width=250)"
"  Recheck Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..40.01 rows=2 width=0)"
"        Index Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

flbla commented 5 months ago

Hi @Vad1mo,

I built a version 2.10 with the rollback on the SELECT FROM task query.
I deployed it on my staging environment: ~7K artifacts (for comparaison, our prod env has arround 125K artifacts) I reset my pg_stat_statements table before I started my tests

The size of the task table when I started my tests was 11313 lines.

Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1	1056	87428.30858300008	82.79195888541675

After I launched some "GC", "Scan All" and "Replications" actions : Size of table task : 10408 lines

Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1	10012	1864485.1329499988	186.22504324310816

Then test with same builded image + change of DATA TYPE to jsonb + add indexes + reset my pg_stat_statements table :

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));

Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1	406	185.97064199999988	0.45805576847290613

After I launched some "GC", "Scan All" and "Replications" actions : Size of table task : 10414 lines

Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1	8526	2642.1774179999943	0.30989648346234977

As you can see, the average time : BEFORE the switch to jsonb + creation of indexes increase with the number of actions done on harbor AFTER the switch to jsonb + creation of indexes the average time is less than 0.5ms and do not change with the number of actions done on harbor

flbla commented 5 months ago

I then re-switch to the 2.10.2 version (the version with slowness) and kept the jsonb and indexes , reset pg state statement : table task size : 10415 lines

Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j	61	15.206537	0.2492874918032787

after some actions on harbor (scan all, gc, replications, ...) : table task size : 12282 lines	Query	Tot_Calls	Tot_Time	Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j	3520	1545.4175580000006	0.4390390789772729

explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}' as jsonb );
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=254)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
(4 rows)

flbla commented 5 months ago

So from my point of view, it could be good to do the following:

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));

wy65701436 commented 5 months ago

We have reverted to using the index defined in the execution table. Please upgrade to the latest version (v2.11) to validate the performance changes.

If the performance still does not meet expectations, we can continue considering the use of the JSONB type. However, based on our testing results, we have observed a significant performance improvement after reverting.

peresureda commented 4 months ago

Hi @wy65701436 ,

After reverting the manual changes (changing the type from JSONB to JSON and removing the index) and applying the patch, we have observed that in terms of the performance of the ScanAll process, it is very similar to what we had with the improvement of JSONB + index. Where there is a small change is in resource usage, as we mentioned, the execution cost of having to transform to JSONB at runtime is higher than already having the data in that format.

CPU USage during ScanAll before Upgrade with jsonb + index: Screenshot 2024-07-11 at 09 53 22

CPU USage during ScanAll with 2.11 without manual improvements: Screenshot 2024-07-11 at 09 53 36

From here, the decision is yours. Regards!

wy65701436 commented 3 months ago

@peresureda thanks for your data, it's helpful for us to do the further investigation.

The performance problem was resolved after upgrade to v2.11, but you still see high CPU usage, am I understand correct? If so, I will do more investigation.

cc @stonezdj

stonezdj commented 3 months ago

Created 1 million records in the task. and run following test.

Operation/query	Before(json)	After(jsonb+gin index)
ScanAll	25s	26s
SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids'	0.7ms	0.6ms

(Before) Aug 14 05:20:20 ubuntu core[2185963]: [ORM]2024/08/14 09:20:20  -[Queries/default] - [  OK /    db.Query /     0.7ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"
(After)Aug 14 05:22:30 ubuntu core[2185963]: [ORM]2024/08/14 09:22:30  -[Queries/default] - [  OK /    db.Query /     0.6ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"`

The CPU/Memory compare Before before_cpu_memory

After: after_cpu_memory

Test ScanAll with 10000 artifact

The performance improvement doesn't seem very significant.

flbla commented 3 months ago

@stonezdj did you created both indexes ?

CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));

stonezdj commented 3 months ago

Yes, After change it should be:

registry=# \d task;
                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | CURRENT_TIMESTAMP
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))
    "idx_task_job_id" btree (job_id)
    "indx_task_jsonb" gin (extra_attrs jsonb_path_ops)
    "task_execution_id_idx" btree (execution_id)
Foreign-key constraints:
    "task_execution_id_fkey" FOREIGN KEY (execution_id) REFERENCES execution(id)

flbla commented 3 months ago

I don't remember about the CPU and memory but in my previous tests I had really better time with Jsonb + gin indexes : https://github.com/goharbor/harbor/issues/20505#issuecomment-2167736595 did you try to do a explain select * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> 0e9f119d-30ba-46b6-9b79-4f7df6b06cea to be sure it uses correct indexes ?

stonezdj commented 3 months ago

SQL:select * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> 0e9f119d-30ba-46b6-9b79-4f7df6b06cea before and after hitting index both, and their performance has no difference. "idx_task_extra_attrs_report_uuids" gin ((extra_attrs::jsonb -> 'report_uuids'::text))

The performance of this SQL doesn't impact the ScanAll API, only impact the list.

If any user wants to apply this change, it is OK to apply it in database directly, no code change required: https://github.com/goharbor/harbor/issues/20505#issuecomment-2147288490

github-actions[bot] commented 1 month ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

goharbor / harbor

Performance Improvement Proposal for queries related with the Task Table #20505