libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB

vishnuchalla commented 1 year ago

Description

We have changed one of the most time taking insert query into a bulk insert query and noticed positive difference in the DB performance.

POC PR: https://github.com/quay/claircore/pull/996

DB Metrics And Observations

Before this change

Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot. Disk_IOPS Clair_DB_CPU Clair_App_CPU

After this change

Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.

Clair_DB_Disk_IOPS_A Clair_DB_CPU_Trend_A Clair_App_CPU_A

If we look at the spikes after this bulk insert changes they are very narrowed and even the IOPS got decreased by 1000s and also lead to better performance on the Clair App side.(i.e reducing the IO wait time of the requests). This will ultimately help us reduce cost from both Clair DB and App perspective.

Profiling Report And Observations

Before this change. Complete report here

General_Activity Average_Query_Execution_Time Queries_Histogram

After this change. Complete report here

General_Activity_A Average_Query_Execution_Time_A Queries_Histogram_A

If we look at the average query execution times, there is a 100 fold difference after moving to a bulk insert strategy (i.e. 250 ms to 2.5 ms). And also the time taken for running overall queries got reduced along with the number of queries running on the DB ultimately resulting in less IOPS on DB. (This will keep DB less busy allowing more incoming requests to be served and also will help us reduce those spikes seen in both Clair App and DB)

Summary And Conclusion

Still a better suggestion would be either, Moving the entire schema to a column based storage as they are well optimized to run these kind of bulk periodic workloads (i.e for every 6 hours) using their distributed execution pattern (i.e sharding which is not present in PostgreSQL). A good list of column store DBs worth exploring in case if decided to move to one in future would be: https://www.g2.com/categories/columnar-databases.

OR

To have below action items on PostgreSQL.
- https://www.postgresql.org/docs/13/performance-tips.html
- https://www.postgresql.org/docs/13/parallel-query.html
- Moving from single INSERT/UPDATE/DELETE statements to bulk INSERT/UPDATE/DELETE statement where in a single statement acts on multiple rows of the table. Supporting posts and example snippets to perform bulk operations:
  - https://dba.stackexchange.com/questions/224989/multi-row-insert-vs-transactional-single-row-inserts
  - https://dba.stackexchange.com/questions/76334/multi-row-insert-vs-multiple-single-row-inserts

    def approval_query_executor(self, table_id, table_name, columns, approval_type):

        """
        executes approval query for cntl_approval_info in rds

        :param table_id: table id according to cntl_table_info in rds
        :param table_name: table name according to cntl_table_info in rds
        :param columns: columns to be approved
        :param approval_type: type of approval like add, update and delete
        :return: None
        """

        executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
                                 self.default_args['rds_dbName'], self.default_args['rds_password'],
                                 self.default_args['rds_username'])
        approval_info = self.default_args['approval_info']
        cntl_columns_approval_query = """INSERT INTO %s (table_id, src_table_name, src_column_name, column_datatype,
         approval_type, approval_flag, last_updated_date, approved_by, active_flag, status, data_length, data_precision,
          data_scale) VALUES """
        params = (approval_info,)
        if len(columns) > 0:
            for column_number in range(0, len(columns)):
                cntl_columns_approval_query += """(%s, '%s', '%s', '%s', '%s', %s, '%s', '%s', %s, '%s', %s, %s, %s),"""
                params += (table_id, table_name, columns[column_number][1], columns[column_number][2], approval_type,
                           0, str(datetime.now()), 'Dejavu-Gatekeeper', 1, 'pending approval for approval in rds',
                           columns[column_number][3], columns[column_number][4], columns[column_number][5])
            cntl_columns_approval_query = cntl_columns_approval_query[:-1] + ";"
            executer.execute_query(cntl_columns_approval_query, params, approval_info)
        else:
            logging.info("No columns to send for approval to {0} in {1}".format(approval_type, table_name))

    def update_sequence_number(self, new_columns, table_name, target=None):

        """
        updates sequence number of a table in cntl_column_info in rds

        :param new_columns: list of new columns
        :param table_name: table name
        :param target: flag to determine update in source or target sequence number
        :return: None
        """

        executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
                                 self.default_args['rds_dbName'], self.default_args['rds_password'],
                                 self.default_args['rds_username'])
        column_info = self.default_args['column_info']
        columns = list()
        new_columns = tuple(new_columns)
        seq_fashion = 'src_seq_no' if target is None else 'trgt_seq_no'
        if len(new_columns) != 0:
            update_sequence_number_query = """UPDATE %s SET %s = CASE src_column_name """
            params = (column_info, seq_fashion)
            for index in range(0, len(new_columns)):
                update_sequence_number_query += """WHEN '%s' THEN %s """
                params += (new_columns[index][1], index + 1)
                columns.append(new_columns[index][1])
            update_sequence_number_query += """END WHERE src_column_name IN """ + str(
                columns).replace('[', '(').replace(']', ')') + """ AND src_table_name='%s' AND src_active_flag=%s
                 AND active_flag=%s;"""
            params += (table_name, 1, 1)
            executer.execute_query(update_sequence_number_query, params, column_info)

We can also think of modifying the current ER model of the DB (which seems to be in 3NF). But modifying DB schema requires a thorough knowledge on how each attribute/column and entity is related to the business and what all activities are expected to be happening between each entity in the model. One simple solution would be to bring data into flat format instead of further normalizing it, which will make querying faster but there will be data duplication which is again a trade off. Incase if we are deciding to modify the existing schema of the DB, we need to consider the above points on how we can design it in such as way so that we can leverage BULK statements like INSERT/UPDATE/DELETE/COPY on top of it to perform bulk operations.

hdonnay commented 1 year ago

Why is the max and average duration so much worse? I don't understand why the graphs don't seem to correlate at all to the numbers in the table. It looks like the execution time is worse in the giant-query version.

vishnuchalla commented 1 year ago

Why is the max and average duration so much worse? I don't understand why the graphs don't seem to correlate at all to the numbers in the table. It looks like the execution time is worse in the giant-query version.

Yes there are some queries that ran for long time (bulk ones) which are there in the top twenty list. But the average time in overall is very less when we did gaint query execution. If we look at the average times for all the inserts before and after this change it got reduced from 250 ms to 2.5 ms (In the averge query duration screenshot). The ones which are being listed in the top 20 list as time consuming are just outliers. And also if those queries in overall were running so long, we would have seen IOPS spikes in DB metrics for so long which is not happening in our case (instead it got reduced from 3000 IOPS to 2000 IOPS on an average).

quay / claircore