spring-cloud / spring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
https://dataflow.spring.io
Apache License 2.0
1.11k stars 580 forks source link

Controller /jobs/thinexecutions is very slow with a large amount of job executions and no navigation possible oracle db #5524

Closed klopfdreh closed 10 months ago

klopfdreh commented 1 year ago

Description: Currently we are using SCDF 2.11.0 and about to upgrade to 2.11.1. The JobExecutionThinController is very slow and we are receiving a timeout on a normal GET operations with a Oracle DB. It might be an issue with the upgrade to 2.11.x, because as of the changes there are several views introduced for example to get AGGREGATE_JOB_EXECUTION and AGGREGATE_JOB_INSTANCE. It might be an issue related to this.

After this issue the "Job Executions" Grid stays empty.

Release versions: 2.11.x

Custom apps: N/A

Steps to reproduce: Produce a large amout of Job-Executions and just click on Job executions in the UI.

Screenshots: image

Additional context:

"Task Executions" is working as expected.

klopfdreh commented 1 year ago

This query seems to be slow as it accesses several views:

SELECT *
FROM (SELECT E.JOB_EXECUTION_ID                        as JOB_EXECUTION_ID,
             E.START_TIME                              as START_TIME,
             E.END_TIME                                as END_TIME,
             E.STATUS                                  as STATUS,
             E.EXIT_CODE                               as EXIT_CODE,
             E.EXIT_MESSAGE                            as EXIT_MESSAGE,
             E.CREATE_TIME                             as CREATE_TIME,
             E.LAST_UPDATED                            as LAST_UPDATED,
             E.VERSION                                 as VERSION,
             I.JOB_INSTANCE_ID                         as JOB_INSTANCE_ID,
             I.JOB_NAME                                as JOB_NAME,
             T.TASK_EXECUTION_ID                       as TASK_EXECUTION_ID,
             E.SCHEMA_TARGET                           as SCHEMA_TARGET,
             (SELECT COUNT(*)
              FROM AGGREGATE_STEP_EXECUTION S
              WHERE S.JOB_EXECUTION_ID = E.JOB_EXECUTION_ID
                AND S.SCHEMA_TARGET = E.SCHEMA_TARGET) as STEP_COUNT
      FROM AGGREGATE_JOB_INSTANCE I
               JOIN AGGREGATE_JOB_EXECUTION E
                    ON I.JOB_INSTANCE_ID = E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET = E.SCHEMA_TARGET
               JOIN AGGREGATE_TASK_BATCH B
                    ON E.JOB_EXECUTION_ID = B.JOB_EXECUTION_ID AND E.SCHEMA_TARGET = B.SCHEMA_TARGET
               JOIN AGGREGATE_TASK_EXECUTION T
                    ON B.TASK_EXECUTION_ID = T.TASK_EXECUTION_ID AND B.SCHEMA_TARGET = T.SCHEMA_TARGET
      ORDER BY E.JOB_EXECUTION_ID DESC)
WHERE ROWNUM <= 20
klopfdreh commented 1 year ago

For 35361 job execution entries when I remove ORDER BY E.JOB_EXECUTION_ID DESC the query execution time is increased from 2s501ms to 171ms maybe there is an index missing?

klopfdreh commented 1 year ago

Also you can't navigate through the job execution pages anymore.

org.springframework.jdbc.BadSqlGrammarException: PreparedStatementCallback; bad SQL grammar [SELECT E.JOB_EXECUTION_ID FROM (SELECT E.JOB_EXECUTION_ID, ROWNUM as TMP_ROW_NUM FROM (SELECT E.JOB_EXECUTION_ID FROM AGGREGATE_JOB_INSTANCE I JOIN AGGREGATE_JOB_EXECUTION E ON I.JOB_INSTANCE_ID=E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET=E.SCHEMA_TARGET JOIN AGGREGATE_TASK_BATCH B ON E.JOB_EXECUTION_ID = B.JOB_EXECUTION_ID AND E.SCHEMA_TARGET = B.SCHEMA_TARGET JOIN AGGREGATE_TASK_EXECUTION T ON B.TASK_EXECUTION_ID = T.TASK_EXECUTION_ID AND B.SCHEMA_TARGET = T.SCHEMA_TARGET ORDER BY E.JOB_EXECUTION_ID DESC)) WHERE TMP_ROW_NUM = 20]; nested exception is java.sql.SQLSyntaxErrorException: ORA-00904: \"E\".\"JOB_EXECUTION_ID\": ungültige ID

klopfdreh commented 1 year ago

Side note: We have around 500k job executions in our database as of now. With this amount the job executions are not displayed anylonger.

klopfdreh commented 1 year ago

I did some tests and tried out the following:


SELECT *
FROM (SELECT E.JOB_EXECUTION_ID                        as JOB_EXECUTION_ID,
             E.START_TIME                              as START_TIME,
             E.END_TIME                                as END_TIME,
             E.STATUS                                  as STATUS,
             E.EXIT_CODE                               as EXIT_CODE,
             E.EXIT_MESSAGE                            as EXIT_MESSAGE,
             E.CREATE_TIME                             as CREATE_TIME,
             E.LAST_UPDATED                            as LAST_UPDATED,
             E.VERSION                                 as VERSION,
             I.JOB_INSTANCE_ID                         as JOB_INSTANCE_ID,
             I.JOB_NAME                                as JOB_NAME,
             T.TASK_EXECUTION_ID                       as TASK_EXECUTION_ID,
             E.SCHEMA_TARGET                           as SCHEMA_TARGET,
             (SELECT COUNT(*)
              FROM AGGREGATE_STEP_EXECUTION S
              WHERE S.JOB_EXECUTION_ID = E.JOB_EXECUTION_ID
                AND S.SCHEMA_TARGET = E.SCHEMA_TARGET) as STEP_COUNT,
             ROW_NUMBER() OVER (PARTITION BY E.JOB_EXECUTION_ID ORDER BY E.JOB_EXECUTION_ID desc) as rn
      FROM AGGREGATE_JOB_INSTANCE I
               JOIN AGGREGATE_JOB_EXECUTION E
                    ON I.JOB_INSTANCE_ID = E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET = E.SCHEMA_TARGET
               JOIN AGGREGATE_TASK_BATCH B
                    ON E.JOB_EXECUTION_ID = B.JOB_EXECUTION_ID AND E.SCHEMA_TARGET = B.SCHEMA_TARGET
               JOIN AGGREGATE_TASK_EXECUTION T
                    ON B.TASK_EXECUTION_ID = T.TASK_EXECUTION_ID AND B.SCHEMA_TARGET = T.SCHEMA_TARGET
      ORDER BY E.JOB_EXECUTION_ID DESC)
where ROWNUM <= 20

by adding ROW_NUMBER() OVER (PARTITION BY E.JOB_EXECUTION_ID ORDER BY E.JOB_EXECUTION_ID desc) as rn the performance was increased drastically

Before: image

After: image

@cppwfs / @corneil - I hope this helps with the SQL analysis. 😀

klopfdreh commented 1 year ago

Hint this function is available for:

klopfdreh commented 1 year ago

For the issue https://github.com/spring-cloud/spring-cloud-dataflow/issues/5524#issuecomment-1774833293 I enabled the logging and it seems that the JdbcAggregateJobQueryDao is generating a paging query which is not valid:

{"timestamp":"2023-10-25T06:35:47.998+0200","level":"DEBUG","thread":"https-jsse-nio-8443-exec-3","logger":"org.springframework.cloud.dataflow.server.repository.JdbcAggregateJobQueryDao","message":"countJobExecutions:SELECT COUNT(1) from AGGREGATE_JOB_EXECUTION","context":"default"}
{"timestamp":"2023-10-25T06:35:47.998+0200","level":"DEBUG","thread":"https-jsse-nio-8443-exec-3","logger":"org.springframework.jdbc.core.JdbcTemplate","message":"Executing SQL query [SELECT COUNT(1) from AGGREGATE_JOB_EXECUTION]","context":"default"}
{"timestamp":"2023-10-25T06:35:48.000+0200","level":"DEBUG","thread":"https-jsse-nio-8443-exec-3","logger":"org.springframework.cloud.dataflow.server.repository.JdbcAggregateJobQueryDao","message":"queryJumpToItem:SELECT E.JOB_EXECUTION_ID FROM (SELECT E.JOB_EXECUTION_ID, ROWNUM as TMP_ROW_NUM FROM (SELECT E.JOB_EXECUTION_ID FROM AGGREGATE_JOB_INSTANCE I JOIN AGGREGATE_JOB_EXECUTION E ON I.JOB_INSTANCE_ID=E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET=E.SCHEMA_TARGET JOIN AGGREGATE_TASK_BATCH B ON E.JOB_EXECUTION_ID = B.JOB_EXECUTION_ID AND E.SCHEMA_TARGET = B.SCHEMA_TARGET JOIN AGGREGATE_TASK_EXECUTION T ON B.TASK_EXECUTION_ID = T.TASK_EXECUTION_ID AND B.SCHEMA_TARGET = T.SCHEMA_TARGET ORDER BY E.JOB_EXECUTION_ID DESC)) WHERE TMP_ROW_NUM = 20:20:20:[]","context":"default"}
{"timestamp":"2023-10-25T06:35:48.000+0200","level":"DEBUG","thread":"https-jsse-nio-8443-exec-3","logger":"org.springframework.jdbc.core.JdbcTemplate","message":"Executing prepared SQL query","context":"default"}
{"timestamp":"2023-10-25T06:35:48.001+0200","level":"DEBUG","thread":"https-jsse-nio-8443-exec-3","logger":"org.springframework.jdbc.core.JdbcTemplate","message":"Executing prepared SQL statement [SELECT E.JOB_EXECUTION_ID FROM (SELECT E.JOB_EXECUTION_ID, ROWNUM as TMP_ROW_NUM FROM (SELECT E.JOB_EXECUTION_ID FROM AGGREGATE_JOB_INSTANCE I JOIN AGGREGATE_JOB_EXECUTION E ON I.JOB_INSTANCE_ID=E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET=E.SCHEMA_TARGET JOIN AGGREGATE_TASK_BATCH B ON E.JOB_EXECUTION_ID = B.JOB_EXECUTION_ID AND E.SCHEMA_TARGET = B.SCHEMA_TARGET JOIN AGGREGATE_TASK_EXECUTION T ON B.TASK_EXECUTION_ID = T.TASK_EXECUTION_ID AND B.SCHEMA_TARGET = T.SCHEMA_TARGET ORDER BY E.JOB_EXECUTION_ID DESC)) WHERE TMP_ROW_NUM = 20]","context":"default"}

As you can see the query refers two times to E.JOB_EXECUTION_ID which is not present in the outer select. I think this is caused because of the sortKeys but I am not 100% sure.

For the performance I am trying out something like:

    private PagingQueryProvider getPagingQueryProvider(String fields, String fromClause, String whereClause, Map<String, Order> sortKeys) throws Exception {
        SqlPagingQueryProviderFactoryBean factory = new SqlPagingQueryProviderFactoryBean();
        factory.setDataSource(dataSource);
        fromClause = "AGGREGATE_JOB_INSTANCE I JOIN AGGREGATE_JOB_EXECUTION E ON I.JOB_INSTANCE_ID=E.JOB_INSTANCE_ID AND I.SCHEMA_TARGET=E.SCHEMA_TARGET" + (fromClause == null ? "" : " " + fromClause);
        factory.setFromClause(fromClause);
        if (fields == null) {
            fields = FIELDS;
        }
        if (fields.contains("E.JOB_EXECUTION_ID")) {
            Order order = sortKeys.get("E.JOB_EXECUTION_ID");
            String orderString = Optional.ofNullable(order).map(orderKey -> orderKey == Order.DESCENDING ? "DESC" : "ASC").orElse("DESC");
            fields += ", ROW_NUMBER() OVER (PARTITION BY E.JOB_EXECUTION_ID ORDER BY E.JOB_EXECUTION_ID " + orderString + ") as RN";
        }
        factory.setSelectClause(fields);
        if (sortKeys.isEmpty()) {
            sortKeys = Collections.singletonMap("E.JOB_EXECUTION_ID", Order.DESCENDING);
        }
        factory.setSortKeys(sortKeys);
        factory.setWhereClause(whereClause);

        return factory.getObject();
    }
klopfdreh commented 1 year ago

Good news - I fixed it.

Bad news - I guess it is an issue in Spring Batch 4.

Here are all fixes I applied:

  1. I applied the following lines in the JdbcAggregateJobQueryDao as described in the previous comment.

    if (fields.contains("E.JOB_EXECUTION_ID")) {
    Order order = sortKeys.get("E.JOB_EXECUTION_ID");
    String orderString = Optional.ofNullable(order).map(orderKey -> orderKey == Order.DESCENDING ? "DESC" : "ASC").orElse("DESC");
    fields += ", ROW_NUMBER() OVER (PARTITION BY E.JOB_EXECUTION_ID ORDER BY E.JOB_EXECUTION_ID " + orderString + ") as RN";
    }
  2. In Spring Batch 4.x there is an issue in the OraclePagingQueryProvider https://github.com/spring-projects/spring-batch/blob/4.3.x/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/support/OraclePagingQueryProvider.java#L50 getSortKeySelect() is applied for generateRowNumSqlQueryWithNesting as innerSelectClause and outerSelectClause which is wrong. outerSelectClause should use the getSortKeysWithoutAliases() and the keys should be joined in a helper method similar to getSortKeySelect()

  3. In SqlPagingQueryUtils https://github.com/spring-projects/spring-batch/blob/4.3.x/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/support/SqlPagingQueryUtils.java#L293 the method buildSortConditions is using getSortKeys() as well which is not working because the outer query does not have a reference to the qualifier of the join of the inner query. Here also provider.getSortKeysWithoutAliases().entrySet() should be used.

onobc commented 1 year ago

Thanks @klopfdreh for all of the sleuthing and solutions. We will be taking a look soon and incorporating your findings. Much appreciated.

onobc commented 1 year ago

I split the other issue off into its own here https://github.com/spring-cloud/spring-cloud-dataflow/issues/5531

onobc commented 11 months ago

Hi @klopfdreh , getting back to this now.

On https://github.com/spring-cloud/spring-cloud-dataflow/issues/5524#issuecomment-1778630171, items 2 and 3 are for the related issue #5531, correct? You said good news is you fixed it. I can see how item 1 fixes the perf issue, but what did you do for 2 and 3?

Thanks

klopfdreh commented 11 months ago

Hey @onobc,

On https://github.com/spring-cloud/spring-cloud-dataflow/issues/5524#issuecomment-1778630171, items 2 and 3 are for the related issue https://github.com/spring-cloud/spring-cloud-dataflow/issues/5531, correct?

yes this is correct - because of 2 and 3 the navigation was not possible

You said good news is you fixed it. I can see how item 1 fixes the perf issue,

It is related to Oracle and you can use the Oracle Optimizer to apply performance hints so that the queries based on that newly created query can be performed faster

, but what did you do for 2 and 3?

I described it without actual changes in the SCDF code as the changes are Spring Batch related. Because of this I pointed to the classes and wrote what I changed. Example for 2.:

  1. In Spring Batch 4.x there is an issue in the OraclePagingQueryProvider https://github.com/spring-projects/spring-batch/blob/4.3.x/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/support/OraclePagingQueryProvider.java#L50 getSortKeySelect() is applied for generateRowNumSqlQueryWithNesting as innerSelectClause and outerSelectClause which is wrong. outerSelectClause should use the getSortKeysWithoutAliases() and the keys should be joined in a helper method similar to getSortKeySelect()

Here are how the changes are applied:


    @Override
    public String generateJumpToItemQuery(int itemIndex, int pageSize) {
        int page = itemIndex / pageSize;
        int offset = (page * pageSize);
        offset = offset == 0 ? 1 : offset;
        String sortKeySelect = this.getSortKeySelect();
        String sortKeyOuterSelect = this.getSortKeyOuterSelect();
        return SqlPagingQueryUtils.generateRowNumSqlQueryWithNesting(this, sortKeySelect, sortKeyOuterSelect, false, "TMP_ROW_NUM = "
                + offset);
    }

       /* ...... */
    private String getSortKeyOuterSelect() {
        StringBuilder sql = new StringBuilder();
        String prefix = "";

        for (Map.Entry<String, Order> sortKey : this.getSortKeysWithoutAliases().entrySet()) {
            sql.append(prefix);
            prefix = ", ";
            sql.append(sortKey.getKey());
        }

        return sql.toString();
    }
       /* ....... */

The issue here is that the outer query can't perform the sort keys with alias as the alias is only available in the inner select.

For 3. The changes are also written with words but no actual code changes but can be read the same way.

onobc commented 10 months ago

Closing via https://github.com/spring-cloud/spring-cloud-dataflow/commit/9f9fdc27e6db7768d1cc773a95c4012f0dbb1320

klopfdreh commented 9 months ago

@onobc working like expected- great job 👍

onobc commented 9 months ago

@onobc working like expected- great job 👍

Are you kidding me!?, you did all the great sleuthing - I just implemented your suggestions 😸

seonuk commented 4 months ago

It seems like the issue with Task and Task Execution is the same. Hasn't it been fixed yet? I'm currently using MariaDB

cppwfs commented 4 months ago

This update is now available in 2.11.3.

seonuk commented 4 months ago

@cppwfs

It seems like there might be a similar issue even after upgrading to version 2.11.3. Any idea what could be the possible cause?

image

image