Fail fast or warn user if queries have mismatched column numbers

chadlwilson commented 2 years ago

Context / Goal

If the two queries expressed in a dataset have different numbers of columns, they cannot possibly produce matches, when doing a hash-based comparison.

We should ideally fail fast, or at the very least warn the user clearly in the results somehow.

Currently we do not do any query parsing at startup, as this is entirely delegated to the runtime drivers for the relevant databases. We also do not want to introduce a startup connectivity dependency on an given datasource.

Expected Outcome

Evaluate and implement some approach to addressing this

is there a simple way to do basic parsing of SQL queries at startup to at least count the number of columns?
- if people get fancy with database specific things like pivots and collations (not sure why) this would likely break down

easiest solution (but least performant) is to just run the whole rec, and when calculating/storing metadata check at this point. Warnings could just be inferred and attached to the returned run from the API. e.g something like the below (or at top level)

"summary": {
    "bothMatched": 0,
    "bothMismatched": 3,
    "sourceOnly": 0,
    "sourceTotal": 3,
    "targetOnly": 0,
    "targetTotal": 3,
    "total": 3,
    "warnings": [ "source query has different number of columns to target query. This is guaranteed to produce 100% mismatches." ]
},

next easiest would be to abort the rec after checking the first row of the target dataset. This would have implications for #41 however, and makes things a bit more complex, as the target parsing needs to understand something about the source parsing.

Out of Scope

Additional context / implementation notes

Related to #74 if we take the "warning" approach there

jiawen-tw commented 2 years ago

To decide on which option would we like to go with:

Option 1: Basic parsing of SQL query

There are sql parser libraries like https://github.com/JSQLParser/JSqlParser (though use of PIVOT would still fail)
Pros
- Fastest failure, no need to establish any connection to source and target databases if fail
Cons
- Support for various vendor-specific syntax is dependent one what is supported by parser library

Option 2: Run whole reconciliation run

After running the whole reconciliation run, compare number of columns in metadata, append mismatched columns as warning
Pros
- Easy to implement
Cons
- The entire run has to complete first before it would fail, which can waste a long time for large datasets

Option 3: After completing the run on source database, do the comparison with first row of target database

Finish the all the steps on the source database (including saving of hashed data)
Pros
- Earlier failure than option 2
Cons
- Later failure than option 1
- Hashing and saving unused data for source database

Option 4: Get the first element from both sources for validation, then all elements for reconciliation

Get the first element from both sources to compare their metadata, and continue with the reconciliation if the number of columns match,
Pros
- Fastest failure
- No saving of hashed data to database yet
Cons
- It might involve having a blocking statement in order to retrieve the first element from both sources to compare (is there a better way?)

Option 5: Build a separate api for validation of the query

User can run the validation api to ensure that their queries are correct before inserting into the application config
Pros:
- Fast feedback on query’s correctness
Cons:
- Need to build a separate api
- Up to the user to check and ensure their query is valid before running the reconciliation run

tommi-lew commented 2 years ago

Sharing my thoughts here:

Option 4 and 5 appears to be more elegant solutions as compared Option 2 and 3. Inclined to explore options 1, 4 and 5 further.
Do we want to strike out Option 1 or it is still worth evaluating? It appears to be a good solution if vendor-specific syntax does not cause much complexities.
Option 5 seems like a good candidate too. We could add a flag into the DB to indicate that the validation has executed, and during the recon run, verify the state if the validation. If the validation was not executed, do not continue, and if the validation did execute, continue to recon. And with this, issue #41 should no longer be a concerned .
Option 4 still looks worth evaluating. But I suspect I'm missing something critical considering my lack of understanding with reactive programming. Also, the complexities with issue #41 as mentioned.

tommi-lew commented 2 years ago

Suggestion

Building on top of Option 4, add a condition to limit to 1 row suggested by @ThaoDang

tommi-lew commented 2 years ago

How databases supported by r2dbc specify number of records to return:

Google Cloud Spanner - LIMIT. Refer to Google Standard SQL.
Oracle - No.
1. Pre-version 12c: ROWNUM together with WHERE
2. From version 12c onwards: Row Limiting Clause
H2 - LIMIT
Maria DB - LIMIT
MS SQL - No. Uses SELECT TOP
PostgreSQL - LIMIT

References

https://www.w3schools.com/sql/sql_top.asp

thoughtworks / recce

Fail fast or warn user if queries have mismatched column numbers #56