Open peresureda opened 6 months ago
Thank you for the improvement, this makes sense.
one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
convert existing data?
Hi @Vad1mo ,
Important points:
PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.
At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table.
create table task_aux as select * from task;
Hi, we probably have the same issue. since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API we saw huge CPU usage on our postgres database since the update It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118
explain command on one of the select * from task :
difference in CPU usage between today (yellow line) and previous week (gray dotted line) :
Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.
hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you
@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you
hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you
@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you
Query analysis :
Before applying the fix :
Table "public.task"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('task_id_seq'::regclass)
execution_id | integer | | not null |
job_id | character varying(64) | | |
status | character varying(16) | | not null |
status_code | integer | | not null |
status_revision | integer | | |
status_message | text | | |
run_count | integer | | |
extra_attrs | jsonb | | |
creation_time | timestamp without time zone | | | now()
start_time | timestamp without time zone | | |
update_time | timestamp without time zone | | |
end_time | timestamp without time zone | | |
vendor_type | character varying(64) | | not null |
Indexes:
"task_pkey" PRIMARY KEY, btree (id)
"idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```
when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index "idx_task_extra_attrs_report_uuids" defined with task table.
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```
If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.
After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```
Here is the result of the explain command :
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on task (cost=8.01..10.02 rows=1 width=253)
Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
-> Bitmap Index Scan on indx_task_jsonb (cost=0.00..8.01 rows=1 width=0)
Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```
In our staging environment here is the performance impact of the modification :
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query
When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.
Modifying the type of the column means we have to check all the internal services of Harbor that might use this table.
It's not easy to evaluate the impact.
Here is a query that helps to validate the correct interactions between task and execution tables :
```WITH task_error as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
count(distinct a.id) as nb_execution,
SUM(task_error.cnt) as task_error_cnt,
SUM(task_success.cnt) as task_success_cnt,
SUM(task_stopped.cnt) as task_stopped_cnt
from execution as a
left Join task_error on task_error.execution_id=a.id
left Join task_success on task_success.execution_id=a.id
left Join task_stopped on task_stopped.execution_id=a.id
group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```
In order to validate the performance fix :
This query should indicates no regression for every executions and tasks dependencies.
AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.
``` vendor_type | to_char | status | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt
--------------------+------------+---------+--------------+----------------+------------------+------------------
EXECUTION_SWEEP | 2024-05-31 | Success | 3 | | 3 |
GARBAGE_COLLECTION | 2024-05-31 | Success | 1 | | 1 |
IMAGE_SCAN | 2024-05-31 | Running | 7 | | |
REPLICATION | 2024-05-31 | Success | 10 | | 16 |
SCAN_ALL | 2024-05-31 | Running | 1 | | 2082 |
SCHEDULER | 2024-05-31 | Running | 1 | | | ```
An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
Example of configuration with a single check on a label :
![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)
The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.
@stonezdj @wy65701436, thank you for the swift fix of the issues.
It is not clear to me why, in addition to changing the query, we can not change the Data Type for generally a better performance?
ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
We are reverting to the previous situation where we are casting the object on each execution in order to use an index. Tests show that CPU consumption decreased exponentially by avoiding this cast and converting the data type. Am I missing any place where you need this field to be in JSON and cannot be migrated to JSONB?
@stonezdj , @wy65701436 ¿?
I would also suggest to transition to jsob data type, it is clearly recommended option for dealing with json in pg, specifically for performance reasons. There are also other opertunities to switch to jsonb.
The PR merged in 2.11 just fixed the index not used issue, for converting json to jsonb type and more index, we need to do more testing to verify it.
Since we didn't encounter any performance issues prior to v2.10.1, the index on report_uuids has proven beneficial. The recent commit did not leverage this existing index, leading to performance problems.
The fix is to rewrite the SQL to ensure the DB engine utilizes the report_uuids index. This approach avoids updating the DB schema, which would require additional upgrade work and testing
This fix will be in v2.9.5, v2.10.3 and v2.11.0.
@stonezdj @wy65701436, thank you for more info, as a quick fix this is a good way forward.
However, I am quite confident, that we can optimize this further.
@flbla @thcdrt @peresureda, you have a rather large data sets, can you provide some data points.
This will also open up opportunities for further improvements, as have other json data types present, that we query.
Hi @Vad1mo
This morning I have done extra tests with the different scenarios commented and I can provide more data:
Execution Plan:
explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'
"Bitmap Heap Scan on task (**cost=68.01..75.71** rows=2 width=220)"
" Recheck Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
" -> Bitmap Index Scan on idx_task_extra_attrs_report_uuids (cost=0.00..68.01 rows=2 width=0)"
" Index Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'
"Bitmap Heap Scan on task (**cost=72.01..79.73** rows=2 width=250)"
" Recheck Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
" -> Bitmap Index Scan on idx_task_extra_attrs_report_uuids (cost=0.00..72.01 rows=2 width=0)"
" Index Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast( '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}' as jsonb )
"Bitmap Heap Scan on task (cost=40.01..47.72 rows=2 width=250)"
" Recheck Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
" -> Bitmap Index Scan on indx_task_jsonb (cost=0.00..40.01 rows=2 width=0)"
" Index Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
Hi @Vad1mo,
I built a version 2.10 with the rollback on the SELECT FROM task query.
I deployed it on my staging environment: ~7K artifacts (for comparaison, our prod env has arround 125K artifacts)
I reset my pg_stat_statements table before I started my tests
The size of the task table when I started my tests was 11313 lines.
Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 | 1056 | 87428.30858300008 | 82.79195888541675 |
After I launched some "GC", "Scan All" and "Replications" actions : Size of table task : 10408 lines
Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 | 10012 | 1864485.1329499988 | 186.22504324310816 |
Then test with same builded image + change of DATA TYPE to jsonb + add indexes + reset my pg_stat_statements table :
ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));
Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 | 406 | 185.97064199999988 | 0.45805576847290613 |
After I launched some "GC", "Scan All" and "Replications" actions : Size of table task : 10414 lines
Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 | 8526 | 2642.1774179999943 | 0.30989648346234977 |
As you can see, the average time : BEFORE the switch to jsonb + creation of indexes increase with the number of actions done on harbor AFTER the switch to jsonb + creation of indexes the average time is less than 0.5ms and do not change with the number of actions done on harbor
I then re-switch to the 2.10.2 version (the version with slowness) and kept the jsonb and indexes , reset pg state statement : table task size : 10415 lines
Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j | 61 | 15.206537 | 0.2492874918032787 |
after some actions on harbor (scan all, gc, replications, ...) : table task size : 12282 lines | Query | Tot_Calls | Tot_Time | Avg_Time |
---|---|---|---|---|
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j | 3520 | 1545.4175580000006 | 0.4390390789772729 |
explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}' as jsonb );
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on task (cost=8.01..10.02 rows=1 width=254)
Recheck Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
-> Bitmap Index Scan on indx_task_jsonb (cost=0.00..8.01 rows=1 width=0)
Index Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
(4 rows)
So from my point of view, it could be good to do the following:
ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));
We have reverted to using the index defined in the execution table. Please upgrade to the latest version (v2.11) to validate the performance changes.
If the performance still does not meet expectations, we can continue considering the use of the JSONB type. However, based on our testing results, we have observed a significant performance improvement after reverting.
Hi @wy65701436 ,
After reverting the manual changes (changing the type from JSONB to JSON and removing the index) and applying the patch, we have observed that in terms of the performance of the ScanAll process, it is very similar to what we had with the improvement of JSONB + index. Where there is a small change is in resource usage, as we mentioned, the execution cost of having to transform to JSONB at runtime is higher than already having the data in that format.
CPU USage during ScanAll before Upgrade with jsonb + index:
CPU USage during ScanAll with 2.11 without manual improvements:
From here, the decision is yours. Regards!
@peresureda thanks for your data, it's helpful for us to do the further investigation.
The performance problem was resolved after upgrade to v2.11, but you still see high CPU usage, am I understand correct? If so, I will do more investigation.
cc @stonezdj
Created 1 million records in the task. and run following test.
Operation/query | Before(json) | After(jsonb+gin index) |
---|---|---|
ScanAll | 25s | 26s |
SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' | 0.7ms | 0.6ms |
(Before) Aug 14 05:20:20 ubuntu core[2185963]: [ORM]2024/08/14 09:20:20 -[Queries/default] - [ OK / db.Query / 0.7ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"
(After)Aug 14 05:22:30 ubuntu core[2185963]: [ORM]2024/08/14 09:22:30 -[Queries/default] - [ OK / db.Query / 0.6ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"`
The CPU/Memory compare Before
After:
Test ScanAll with 10000 artifact
The performance improvement doesn't seem very significant.
@stonezdj did you created both indexes ?
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));
Yes, After change it should be:
registry=# \d task;
Table "public.task"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('task_id_seq'::regclass)
execution_id | integer | | not null |
job_id | character varying(64) | | |
status | character varying(16) | | not null |
status_code | integer | | not null |
status_revision | integer | | |
status_message | text | | |
run_count | integer | | |
extra_attrs | jsonb | | |
creation_time | timestamp without time zone | | | CURRENT_TIMESTAMP
start_time | timestamp without time zone | | |
update_time | timestamp without time zone | | |
end_time | timestamp without time zone | | |
vendor_type | character varying(64) | | not null |
Indexes:
"task_pkey" PRIMARY KEY, btree (id)
"idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))
"idx_task_job_id" btree (job_id)
"indx_task_jsonb" gin (extra_attrs jsonb_path_ops)
"task_execution_id_idx" btree (execution_id)
Foreign-key constraints:
"task_execution_id_fkey" FOREIGN KEY (execution_id) REFERENCES execution(id)
I don't remember about the CPU and memory but in my previous tests I had really better time with Jsonb + gin indexes : https://github.com/goharbor/harbor/issues/20505#issuecomment-2167736595
did you try to do a explain select * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> 0e9f119d-30ba-46b6-9b79-4f7df6b06cea
to be sure it uses correct indexes ?
SQL:select * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> 0e9f119d-30ba-46b6-9b79-4f7df6b06cea
before and after hitting index both, and their performance has no difference.
"idx_task_extra_attrs_report_uuids" gin ((extra_attrs::jsonb -> 'report_uuids'::text))
The performance of this SQL doesn't impact the ScanAll API, only impact the list.
If any user wants to apply this change, it is OK to apply it in database directly, no code change required: https://github.com/goharbor/harbor/issues/20505#issuecomment-2147288490
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.
**Is your feature request related to a problem? Please describe.** When you have the Trivy scanner activated and want to perform one of the following operations:
The following query is executed in the database:
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as jsonb )
This query becomes very heavy for the database CPU, and in some cases, Harbor timeouts when the number of artifacts is large.
This is because the extra_attrs field is of type JSON and cannot have any kind of index on it.
Solution Proposal
Since the JSON and JSONB data types accept the same input and produce the same output, my recommendation is to change the data type to JSONB to allow indexing:
ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
Improvements With this improvement, we have achieved the following results: