open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.45k stars 1.04k forks source link

Add ingestion of MsSQL procedures to OpenMetaData #9269

Closed stepanof closed 9 months ago

stepanof commented 1 year ago

Is your feature request related to a problem? Please describe. We use procedures on MsSQL servers in our ETL processes. If the procedure was changed so our ETL can fail and we won't know the reason. So it will be really helpfull if OM could send alert if procedures code was changed.

Describe the solution you'd like OM can ingest not only tables metadata but also metadata of procedures on MsSQL servers. Also procedures may be ingested as pipelines.

Ossy81 commented 1 year ago

Hi is the now added to a release? Or when will this feature added?

phalix commented 1 year ago

Hi @harshach, how is such a feature implementation decided? Conceptually: Is a procedure a pipeline or should it be implemented simarily to views? I see a problem with the latter, since a procedure must not have data as a result. And the former is a problem since MSSQL (or any other database with procedures) is of databaseservice and not of pipelineservice. Do you have any suggestions of how to proceed with this feature?

proverbs53 commented 1 year ago

I would certainly advocate the possibility for MSSQL (or any other database with procedures) to be (also) as of type pipelineservice. As stated by @phalix , stored procedures do not generate a (meaningful) resultset in most cases. Also for us this is a feature we're very much looking forward to.

harshach commented 1 year ago

@phalix @proverbs53 the feature is already underway for 1.2. Please look at this ticket for details https://github.com/open-metadata/OpenMetadata/issues/12998

k3agan commented 1 year ago

Let's close this, as Harsha said it's already part of 1.2 release

pmbrull commented 9 months ago

hi @stepanof I am looking into this. I have figured out how to list the stored procedures, get their definition etc. I am looking now on the best way to figure out when was each Stored Procedure executed (similar to how we get Snowflake Stored Procedure lineage, for example).

Looks like we should be able to use the sys.dm_exec_procedure_stats table, but it only stores the last execution for each procedure. This means we might be missing some info.

LMK if you have any pointers here. Will keep looking meanwhile.

UPDATE: To give some more details, we usually do:

  1. Get the list of queries ✅
  2. Get the list of EXEC calls 🔴 -> That's what I am missing right now
  3. Cross both info to check which queries the EXEC calls are triggering

Thanks

pmbrull commented 9 months ago

ok so the final query to extract Stored Procedure lineage based on the executed queries will be this so far:

WITH SP_HISTORY (start_time, end_time, procedure_name, query_text) AS (
select 
  s.last_execution_time start_time,
  DATEADD(s, s.total_elapsed_time/1000, s.last_execution_time) end_time,
  OBJECT_NAME(object_id, database_id) as procedure_name,
  text as query_text
from sys.dm_exec_procedure_stats s
CROSS APPLY sys.dm_exec_sql_text(s.plan_handle)
WHERE OBJECT_NAME(object_id, database_id) IS NOT NULL
),
Q_HISTORY (database_name, query_text, start_time, end_time, duration,query_type, schema_name, user_name) AS (
  select    
    db.NAME database_name,
    t.text query_text,
    s.last_execution_time start_time,
    DATEADD(s, s.total_elapsed_time/1000, s.last_execution_time) end_time,
    s.total_elapsed_time/1000 duration,
    case
        when t.text LIKE '%%MERGE%%' then 'MERGE'
        when t.text LIKE '%%UPDATE%%' then 'UPDATE'
        when t.text LIKE '%%SELECT%%INTO%%' then 'CREATE_TABLE_AS_SELECT'
        when t.text LIKE '%%INSERT%%' then 'INSERT'
    else 'UNKNOWN' end query_type,
    NULL schema_name,
    NULL user_name
  FROM sys.dm_exec_cached_plans AS p
  INNER JOIN sys.dm_exec_query_stats AS s
    ON p.plan_handle = s.plan_handle
  CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) AS t
  INNER JOIN sys.databases db
    ON db.database_id = t.dbid
  WHERE s.last_execution_time between '2024-01-13' and '2024-01-20'
  AND t.text NOT LIKE '/* {{"app": "OpenMetadata", %%}} */%%'
  AND t.text NOT LIKE '/* {{"app": "dbt", %%}} */%%'
  AND p.objtype NOT IN ('Prepared', 'Proc')
)
select 
  Q.query_type AS QUERY_TYPE,
  Q.database_name  AS QUERY_DATABASE_NAME,
  Q.schema_name AS QUERY_SCHEMA_NAME,
  Q.query_text AS QUERY_TEXT,
  Q.user_name AS QUERY_USER_NAME,
  Q.start_time AS QUERY_START_TIME,
  Q.duration AS QUERY_DURATION,
  SP.procedure_name AS PROCEDURE_NAME,
  SP.query_text AS PROCEDURE_TEXT,
  SP.start_time AS PROCEDURE_START_TIME,
  SP.end_time AS PROCEDURE_END_TIME
from SP_HISTORY SP
JOIN Q_HISTORY Q
  ON (
    Q.start_time BETWEEN SP.start_time and SP.end_time
    OR Q.end_time BETWEEN SP.start_time and SP.end_time
    )
order by PROCEDURE_START_TIME desc
;

We're trying to join the time an SP was executed vs. the queries executed during that time as well. Other dbs helps us narrow this down further by joining via SESSION_ID, which would help us reduce false positives.

Did not find a way to get that in MSSQL, so will do this first iteration.

If anyone has a better idea to link SP <> Query so that we can parse the lineage, please let us know.

Thanks