sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.91k stars 209 forks source link

On Oracle, checks involving regular expressions are generated incorrectly. #2125

Open stelapo opened 4 months ago

stelapo commented 4 months ago

Hi, I report a recent bug introduced after version 3.3.6; it is not there in that version.

With the Oracle datasource, checks via regex are resolved by Soda generating a query that misuses the NVL function by applying it to the REGEX_LIKE function.

I explain better with an example.

The check is:

checks for BENEFICIARIO_LIGHT:
  - invalid_percent(TELEFONO) = 0:
      name: Invalid beneficiary telephone number
      valid format: phone number

Wrong query generated by Soda (version 3.3.9):

SELECT
  COUNT(*),
  COUNT(CASE WHEN NOT (TELEFONO IS NULL) AND NOT (NVL(REGEXP_LIKE(TELEFONO, '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$'), 0)) THEN 1 END)
FROM BENEFICIARIO_LIGHT

Corrected query generated by Soda (version 3.3.6):

SELECT
  COUNT(*),
  COUNT(CASE WHEN NOT (TELEFONO IS NULL) AND NOT (REGEXP_LIKE(TELEFONO, '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$')) THEN 1 END) 
FROM BENEFICIARIO_LIGHT

As can be seen, the use of the NVL function was mistakenly introduced in the first query.

All forms of checks involving regular expressions are generated incorrectly; another example:

checks for BENEFICIARIO_LIGHT:
  - invalid_percent(EMAIL) = 0:
      name: Invalid beneficiary email
      valid regex: ${EMAIL_REGEX}

Thank you

tools-soda commented 4 months ago

CLOUD-8028

stelapo commented 4 months ago

Update: today I tried version 3.3.10 and the problem persists.