prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.76k stars 5.29k forks source link

Query Audit And Governance Support in Presto #19041

Open skairali opened 1 year ago

skairali commented 1 year ago

This is a request for consideration for a new plugin type that can deal with Audit and Governance

As the title indicates this new type of plugin should be able to do

1) Query Audit 2) Governance

This plugin could for example rewrite the query based on policies

  1. By default this could be in a None mode - which be AS IS
  2. There can be a default implementation which can transform the queries based on a POLICY.JSON. This would enable anybody who wants to add some basic rules in the engine.
  3. Anybody could write their implementations to handle the governance in the way they want For example :

This plugin type could be integrated to presto just after the parser is executed and preparedQuery is available.

rschlussel commented 1 year ago

Thanks for opening this issue. We have a similar need and are working on refactoring to make the analyzer pluggable and provide a clean interface between the analyzer and planner. This works for us, as we have an internal analyzer that has those capabilities, and is already used for some other internal services (it will be open sourced eventually, but it will be a little bit until we get there).

do you have ideas about what you'd need from the plugin interface? It sounds like you're thinking to put this after the query is parsed, and before it is analyzed.

@jainxrohit will this be affected at all by the analyzer refactoring?

Also cc: @mlyublena @highker

jainxrohit commented 1 year ago

Thanks @rschlussel for cc and @skairali for creating the issue.

This is interesting use-case, and we should discuss more to understand it better. This I believe belong to QueryRewrite framework, which we are internally discussing at this point and figuring out how to build this. This should ideally be independent of pluggable analyzer work which we are doing however we will know more once we understand more about the requested feature more.

skairali commented 1 year ago

@rschlussel Yes. Exactly. after query parsed and before its analysed. I can come back with more details in few. Happy to be part of such changes/ discussions. @jainxrohit

skairali commented 1 year ago

Let me try to indulge in an effort to elaborate the requirements

What we are proposing is a new plugin type which can be called as QueryAuditAndGovernance (QAAG).

Why we need QAAG?

System access control (SAC) plugin generally provides solutions for access and control related matters ( such as can do etc) . However QAAG will provide solutions for query auditing and governance and if needed any manipulation in query in compliance with rules. So for example we will need to apply masking for column PHONENUMBER. Now this could be a rule and QAAG could change query to have masking function embedded in the query provided there are corresponding function plugins (UDFs). QAAG will have to do more than just rewriting for example it will have to handle the mappings between these functions and the policies . Also QAAG could be.right place for Query Auditing and LOGGING requirements. QAAG could send information to external auditing systems also it could even send details to a historic query DB if there is such a thing.

How will QAAG interface look like?

A rough look of interface may look like

public interface QueryAuditAndGovernance
{
    public String governquery(PreparedQuery preparedQuery, String query, String catalog, String schema, Identity identity);

    public void auditquery(PreparedQuery preparedQuery, String query, String catalog, String schema, Identity identity);
}

How could a contribution to common presto code look like?

  1. We could have a DEFAULT QAAG mode which is DO NOTHING ( similar to ALLOW ALL in SAC)

  2. We could have a STANDARD INBUILT QAAG mode which can read from STANDARD POLICY FILE which could be managed by community also anybody could change while building their engine .. Here Governance will be based on handwritten policy file

  3. We could also have a EXTERNAL QAAG PLUGIN ( every individual presto adopter can have their own implementation). They could use this and integrate with EXTERNAL POLICY STORES and EXTERNAL ADUITING SYSTEMS. A sample external plugin could also be part of this contribution6.

  4. Documentation of new plugin with examples

diagrammatic representation of the proposal

Please note that the REDs are GOV flows and GREEN are AUDIT flows

image

rschlussel commented 1 year ago

A prepared query is basically just a sql string (with any parameters replaced with appropriate values from prepared statements). Will string -> string be enough for you for QAAG? How will the governance plugin be able to understand the query? One reason we want to do these kinds of rewrites after analysis, is that having a semantic representation of the query allows rewrites to be more powerful than string to string manipulations without needing the plugin to reimplement a parser/analyzer.

skairali commented 1 year ago

@rschlussel I am open to suggestions here. If you can point to me a specific place where we can integrate this in a meaningful way - I can try something as well. However generally when I looked at the public abstract class PreparedQuery - thats giving most what we may need from an Audit and Governance perspective. Also another question in-front of us is - if we govern the query after the analysis and if the gov changes the query considerably - that means the cost of the query may be now different. Which means the analysis results might be not accurate anymore( may be my oversight but I think I should point it out).

Also @rschlussel Need your help to know whether this can be accepted as a genuine requirement ( if you and other experts think analyser refactoring is not going to give us an option to solve the problem ). I really think that we need a QAAG in presto

rschlussel commented 1 year ago

In general, yes, I think an auditing and governance framework is a great addition, we just need to make sure we get the design right.

The analysis phase doesn't compute the cost of the query, so that issue isn't a concern. Rather, it does semantic analysis of the query (e.g. checking all the types/functions, what tables you're writing to/ reading from and gets the metadata needed, expands view definitions, etc.). if you pass a prepared query to the governance plugin, you just have the AST. Analysis provides some semantic context for that AST. However, the flip side is that the analysis is kind of messy and not very consumable by anything other than the planner. That's why i'm wondering what kind of information we expect this plugin might need, so we can makes sure we create the right interfaces.

jainxrohit commented 1 year ago

I agree with Rebecca, I think the right place for this framework is after the analysis. The rewriting capabilities for a query is much enhanced once analysis is available. However we don't really have a good rewrite framework available at the moment. Can you please also share some concrete examples use-case for this framework?

skairali commented 1 year ago

@rschlussel @jainxrohit I agree with your suggestion. Few timeline related questions

1). By when do you think the pluggability in analysis phase/good rewrite framework. this may be available?

2). For my current requirement - I could develop a QAAD at the point I mentioned earlier. Do you think a capability (which can be turned off anytime) can be reviewed as a PR as an interim approach?

In general use cases are

1) Change the query according to Gov rules

2). Allow to plugin Audit and logging systems specifically for Queries (not the presto execution logs or auditing)

3). Any future use cases where can check compliance of queries ( such as checking whether the query results may be HIPPA compliant etc.. as a plug and play model based on the compliance framework the consumer has)

A very good rewrite capability will allow me to develop (1)

(2) and (3) may need just query and may be also to flag issues or send query to external system.

(3) is a distant use case. at least for me. (1) & (2) are immediate requirement.

jainxrohit commented 1 year ago

At this moment, we do not have any plans to add query rewrite framework on top of analysis in short to medium term. We have some ideas and we may have one on top of query analysis, but it wont be available any time soon.

Presto does offer some rewrite capabilities via Optimizers. Even thought it is plan rewrite, but it should work for many cases. I would like to understand if optimizers rules can work for your case.

jaystarshot commented 1 year ago

Extending @rschlussel 's point, a query plan will have even richer features and can be rewritten potentially. So can we make the interface take a PlanNode as the input instead of the Analysis? We can then allow the QAAG to be done after any set of optimizers instead of all. Basically after subtle logical plan divide between the optimizers where this rewrite could be done.

skairali commented 10 months ago

@jaystarshot what about scenarios where in

Presto may be able to send/outsource query and small details like who is the caller etc to a data security solution which can detect threat from these informations and say NO to further processing. In such cases QAAG may need to intercept even before planing phase

I am open to your suggestion as well.

Can you share a new interface example in this case?

skairali commented 9 months ago

@tdcmeehan. @jaystarshot I am talking about multiple exits in presto qiuery processing flow here

Lets assume that there is a near-real-time threat response workflows, and automated compliance auditing and reporting.

If such governance system is there which needs to

1) Know about all the queries from a particular IP or user 2) Need to be able to intercept at various stages of query processing ( raw, authenticated, parsed) etc

In this case - we need to be able to call the governance system at multiple places and that can record, block or audit or in some cases remediate risks by rewrites etc

This needs a Chained QAAG plugin type- a new plugin .

I can submit a sample of such a plugin soon - Please do let me know thoughts about this proposal.