kagkarlsson / db-scheduler

Persistent cluster-friendly scheduler for Java
Apache License 2.0
1.25k stars 191 forks source link

feat: Support lock-and-fetch for MSSQL. Explicit LIMIT for all. #371

Closed kagkarlsson closed 1 year ago

kagkarlsson commented 1 year ago

Fixes

Further work / fix later

Reminders


cc @kagkarlsson

kagkarlsson commented 1 year ago

Remaining to test against MSSQL. Verified to work with postgres using new generic method that is transaction-based.

kagkarlsson commented 1 year ago

This is a bit tricky to test since Testcontainers does not seem to work on M1 Macs and testing via Github Actions sporadically fails. I have done some local testing using AzureSQLEdge, and the lock-and-fetch strategy seem to be working. If anyone using MSSQL could try this branch out that would be helpful.

With a little luck this might fix the deadlock issues experienced using the default polling strategy. I have gotten the deadlock locally (AzureSQLEdge) using default polling, but not using lock-and-fetch.

Is it possible for you do help out verifying this change for MSSQL @rafaelhofmann ?

This is the query currently in use:

    @Override
    public String createGenericSelectForUpdateQuery(String tableName, int limit, String requiredAndCondition) {
        return "select TOP "+limit+" * from " + tableName +
            " WITH (readpast,rowlock) where picked = ? and execution_time <= ? " + requiredAndCondition +
            " order by execution_time asc ";
    }
kagkarlsson commented 1 year ago

I added with (readpast) to the regular query as well for mssql (locally), and now I do not get the dead-lock issues @rafaelhofmann

rafaelhofmann commented 1 year ago

Hi @kagkarlsson

I will test this new branch and check if we still have problems with the indexes.

kagkarlsson commented 1 year ago

I have enabled both polling-strategies for MSSQL in this branch. Both seem to run successfully on GA via MssqlCompatibilityTest

Additionally, I added hints to skip locked to default-query as well since deadlocks were causing sporadic test-failures (for MSSQL)

rafaelhofmann commented 1 year ago

Hi @kagkarlsson,

Do you have a published snapshot version that we can use to test or should we build our own version?

Thanks

kagkarlsson commented 1 year ago

Currently you would need to build your own. Let me know if that is a problem.

rafaelhofmann commented 1 year ago

I have built the branch and upgraded our ACPT system (including all the indexes that were deleted in the past) to the new release, but we are running into the same problem as before: A deadlock about once every hour, when we have a load of 300 tasks/min. I have been experimenting with a query that combines the SELECT and UPDATE in one, but I am not quite happy yet with the query hint for MSSQL:

String selectForUpdateQuery =
            " UPDATE " + ctx.tableName +
                " SET " + ctx.tableName + ".picked = ?, " +
                "     " + ctx.tableName + ".picked_by = ?, " +
                "     " + ctx.tableName + ".last_heartbeat = ?, " +
                "     " + ctx.tableName + ".version = " + ctx.tableName + ".version + 1 " +
                " OUTPUT [inserted].* " +
                " FROM ( " +
                "   SELECT TOP(?) ist2.task_name, ist2.task_instance " +
                "   FROM " + ctx.tableName + " ist2 WITH (ROWLOCK, READPAST) " +
                "   WHERE ist2.picked = ? AND ist2.execution_time <= ? " + unresolvedFilter.andCondition() +
                "   ORDER BY ist2.execution_time ASC " +
                " ) AS st2 " +
                " WHERE st2.task_name = " + ctx.tableName + ".task_name " +
                "   AND st2.task_instance = " + ctx.tableName + ".task_instance";

In theory this should do the same as the lock-and-fetch query from PostgreSQL.

kagkarlsson commented 1 year ago

Ok, thanks. Did you try both lock-and-fetch and fetch-and-lock-on-execute? I think they should both work for MSSQL now. (lock-and-fetch for MSSQL is select-for-update and update spanned by a transaction)

kagkarlsson commented 1 year ago

Your combined query looks interesting

rafaelhofmann commented 1 year ago

We had deadlocks with the lock-and-fetch strategy, but so far no errors with the default fetch strategy after four hours of running. Will keep this running for a few more days to verify.

kagkarlsson commented 1 year ago

How has the fetch strategy run?

rafaelhofmann commented 1 year ago

The fetch strategy runs without any deadlock or error for the last 30 days with all the recommended indexes.

kagkarlsson commented 1 year ago

Ok, that is good news. I was hoping for good news on the other strategy as well, but I will take 1 out of 2 :).

This means you are happy with the changes in the PR and would like to have it merged?

kgunnerud commented 12 months ago

Great fix! Any ETA on release with this fix?

kagkarlsson commented 11 months ago

Good question, nothing blocking it from being released, just time I guess :). Hopefully I can release it within 1-2 weeks

kagkarlsson commented 11 months ago

🎉 This issue has been resolved in v13.0.0 (Release Notes)

ChristianCiach commented 9 months ago

Hi! I am currently evaluating db-scheduler for our use-cases and doing some code reviews in the process.

I see that the MariaDBJdbcCustomization, now uses FOR UPDATE SKIP LOCKED for SELECT-queries even though MariaDB only supports SKIP LOCKED since MariaDB 10.6, according to the documentation: https://mariadb.com/kb/en/select/#skip-locked

I don't particularly care about older versions of MariaDB, but I just want to let you know that the versions 10.4 and 10.5 are technically still supported, according to https://mariadb.com/kb/en/mariadb-server-release-dates/. I see you already differentiate the versions of MySQL, so maybe you want to do the same for MariaDB.

Anyway, thank you for your work! I like the simplicity of db-scheduler and I hope we can use it to eventually replace Quartz, which seems to be completely dead by now.

kagkarlsson commented 4 months ago

I had some issue when testing the support in MySQL (or MariaDB) and limited time, so I opted for disabling the support until someone had the time to verify it works as expected. Feel free to fork and try it out.

  @Override
  public boolean supportsGenericLockAndFetch() {
    // FIXLATER: fix syntax and enable it for versions of MariaDB that supports it
    return false;
  }