sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.91k stars 208 forks source link

Fetch generated SQL queries for the checks configured in context #1906

Open scchengaiah opened 1 year ago

scchengaiah commented 1 year ago

Dear All,

I want to extract the SQL queries generated for every check defined in the checks.yml file via Python API.

I could able to visualize the SQL queries when I set the Verbose flag to True in the scan results JSON. However, I could not connect these generated SQL's with the checks. I could not find any id or reference information that indicates the SQL is for a particular check.

Could you please suggest if there is an approach to handle this ?

Please find below the initial analysis. Expectation is to iterate through the checks array and fetch all the failed checks. For each of the failed check, extract the SQL generated.

Unable to identify a common relation between the check object attributes and the log generated. Any standard pattern to fetch this information can be helpful as well 🙂.

checks.yml:

checks for administration$account:
  - row_count = 0

  - row_count_user_defined = 0:
      row_count_user_defined query: |
        SELECT COUNT(*)
        FROM administration$account

checks for system$userrole:
  # https://docs.soda.io/soda-cl/validity-metrics.html#specify-valid-or-invalid-values
  - invalid_count(name) = 0:
      valid values: [Administrator]

JSON Logs:

{
  "definitionName": "Scan from Soda API library",
  "defaultDataSource": "postgres",
  "dataTimestamp": "2023-07-11T09:43:20+00:00",
  "scanStartTimestamp": "2023-07-11T09:43:20+00:00",
  "scanEndTimestamp": "2023-07-11T09:43:20+00:00",
  "hasErrors": false,
  "hasWarnings": false,
  "hasFailures": true,
  "metrics": [
    {
      "identity": "metric-Scan from Soda API library-postgres-system$userrole-name-invalid_count-7a0bcc50",
      "metricName": "invalid_count",
      "value": 2,
      "dataSourceName": "postgres"
    },
    {
      "identity": "metric-Scan from Soda API library-postgres-administration$account-row_count",
      "metricName": "row_count",
      "value": 0,
      "dataSourceName": "postgres"
    },
    {
      "identity": "metric-Scan from Soda API library-postgres-row_count_user_defined = 0-1cc32be5",        
      "metricName": "row_count_user_defined = 0",
      "value": 0.0,
      "dataSourceName": "postgres"
    }
  ],
  "checks": [
    {
      "identity": "fe9ac1fb",
      "name": "row_count = 0",
      "type": "metricThreshold",
      "definition": "checks for administration$account:\n  row_count = 0",
      "resourceAttributes": [],
      "location": {
        "filePath": "d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml",
        "line": 2,
        "col": 5
      },
      "dataSource": "postgres",
      "table": "administration$account",
      "filter": null,
      "column": null,
      "metrics": [
        "metric-Scan from Soda API library-postgres-administration$account-row_count"
      ],
      "outcome": "pass",
      "outcomeReasons": [],
      "archetype": null
    },
    {
      "identity": "59ed27b7",
      "name": "row_count_user_defined = 0",
      "type": "metricThreshold",
      "definition": "checks for administration$account:\n  - row_count_user_defined = 0:\n      row_count_user_defined query: |\n        SELECT COUNT(*)\n        FROM administration$account\n\n",
      "resourceAttributes": [],
      "location": {
        "filePath": "d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml",
        "line": 4,
        "col": 5
      },
      "dataSource": "postgres",
      "table": "administration$account",
      "filter": null,
      "column": null,
      "metrics": [
        "metric-Scan from Soda API library-postgres-row_count_user_defined = 0-1cc32be5"
      ],
      "outcome": "pass",
      "outcomeReasons": [],
      "archetype": null
    },
    {
      "identity": "d8f910fb",
      "name": "invalid_count(name) = 0",
      "type": "metricThreshold",
      "definition": "checks for system$userrole:\n  - invalid_count(name) = 0:\n      valid values: [Administrator]\n",
      "resourceAttributes": [],
      "location": {
        "filePath": "d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml",
        "line": 11,
        "col": 5
      },
      "dataSource": "postgres",
      "table": "system$userrole",
      "filter": null,
      "column": "name",
      "metrics": [
        "metric-Scan from Soda API library-postgres-system$userrole-name-invalid_count-7a0bcc50"
      ],
      "outcome": "fail",
      "outcomeReasons": [],
      "archetype": null
    }
  ],
  "automatedMonitoringChecks": [],
  "profiling": [],
  "metadata": [],
  "logs": [
    {
      "level": "INFO",
      "message": "Soda Core 3.0.22",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 1,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Scan execution starts",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 2,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Postgres connection properties: host=\"localhost\", port=\"5432\", database=\"mendix\", user=\"postgres\", options=\"-c search_path=public\", connection_timeout=\"None\"",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 3,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Query postgres.administration$account.aggregation[0]:\nSELECT \n  COUNT(*) \nFROM public.administration$account",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 4,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Query postgres.system$userrole.aggregation[0]:\nSELECT \n  COUNT(CASE WHEN NOT (name IS NULL) AND NOT (name IN ('Administrator')) THEN 1 END) \nFROM public.system$userrole",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 5,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Query postgres.name.failed_rows[invalid_count]:\nSELECT * FROM public.system$userrole \n WHERE NOT name IS NULL AND NOT name IN ('Administrator') \n LIMIT 100",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 6,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "Using DefaultSampler",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 7,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "Query postgres.user_defined_query[row_count_user_defined = 0]:\nSELECT COUNT(*)\nFROM administration$account",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 8,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "Scan summary:",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 9,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "4/4 queries OK",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 10,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "  postgres.administration$account.aggregation[0] [OK] 0:00:00.015623",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 11,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "  postgres.system$userrole.aggregation[0] [OK] 0:00:00",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 12,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "  postgres.name.failed_rows[invalid_count] [OK] 0:00:00",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 13,
      "doc": null,
      "location": null
    },
    {
      "level": "DEBUG",
      "message": "  postgres.user_defined_query[row_count_user_defined = 0] [OK] 0:00:00",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 14,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "2/3 checks PASSED: ",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 15,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "    administration$account in postgres",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 16,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "      row_count = 0 [d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml] [PASSED]",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 17,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "        check_value: 0",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 18,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "      row_count_user_defined = 0 [d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml] [PASSED]",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 19,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "        check_value: 0.0",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 20,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "1/3 checks FAILED: ",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 21,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "    system$userrole in postgres",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 22,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "      invalid_count(name) = 0 [d:\\gitlab\\data-lake\\data-lake\\python_scripts\\data_quality\\resources\\checks.yml] [FAILED]",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 23,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "        check_value: 2",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 24,
      "doc": null,
      "location": null
    },
    {
      "level": "INFO",
      "message": "Oops! 1 failures. 0 warnings. 0 errors. 2 pass.",
      "timestamp": "2023-07-11T09:43:20+00:00",
      "index": 25,
      "doc": null,
      "location": null
    }
  ]
}
jmarien commented 1 year ago

SODA-1786

scchengaiah commented 1 year ago

SODA-1786 @jmarien Could you please share the respective link or elaborate ? Is there an issue already opened for this.

benjamin-pirotte commented 1 year ago

Hi @scchengaiah, thank you for reaching out!

The previous message was an automatic message triggered by the integration with our ticketing system.

I will reach out to the engineering team to see if there is a current workaround to achieve what you are describing.

Out of curiosity, could you let me know why you want to extract the generated SQL?

Thanks a lot!

scchengaiah commented 1 year ago

@benjamin-pirotte : Thank you for the response.

We want to use SQL parsers to detect the tables and the columns involved from the generated SQL query and link to the failed checks.

We believe that this can help our data analysts to quickly get the relevant information on a particular check in case of complex SQL statements. Also, the generated SQL shall be executed manually which can help to decide the next set of actions.

scchengaiah commented 1 year ago

@benjamin-pirotte : Do we have any update from the engineering team ?