trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.08k stars 2.91k forks source link

Extend AccessControl functionality to filter columns from SELECT * instead of throwing #7461

Closed youngchen7 closed 2 years ago

youngchen7 commented 3 years ago

The current behavior when a user doesn't have access to all columns is to throw AccessDenied on SELECT *. The user has to submit an explicit SELECT query omitting the denied column for it to succeed.

In the current implementation AccessControl#checkCanSelectFromColumns throw an AccessDenied error, while filterColumns is used to restrict the information shown on SHOW COLUMNS or DESCRIBE but does not restrict SELECT * results.

Would it be possible to extend the AccessControl interface/integration with the engine to allow configurable handling for access restricted columns?

Related discussion: https://github.com/trinodb/trino/pull/6017

findepi commented 3 years ago

FWIU, the behavior we implemented follows the SQL specification. What we're asking here about is not AccessControl behavior (it controls whether user has access), but an engine/language change, amending the meaning of SELECT *.

I am symphatizing with end-user's use-case though.

cc @martint @kokosing

A side note. Spec aside, one thing to consider could be -- if SELECT * returned only the accessible subset of columns, then user could do CREATE TABLE new_system.new_table AS SELECT * FROM old_system.old_table, they wouldn't know whether they did create a full copy of their data, or only part of it.

kokosing commented 3 years ago

CREATE TABLE new_system.new_table AS SELECT * FROM old_system.old_table, they wouldn't know whether they did create a full copy of their data, or only part of it.

I think it also applies to SELECT * FROM old_system.old_table, they would also like to know they queried all columns or not.

youngchen7 commented 3 years ago

Good point - although I think that in these cases the user is just going to end up querying or copying the full set of data available to them which shouldn't break any use cases right? They shouldn't have access to these columns, so whether it exists or not should be functionally the same to them.

On a side note, AccessControl#filterColumns hides column metadata from the user - wouldn't that behavior be in line with this proposal as well?

What if the implementation is done via extending AccessControl with, for example, "filterSelectColumns" and hooking it up in the engine - then whether this behavior is enabled/disabled can be left up to the AccessControl implementation. This should be relatively unobtrusive and easily customizable.

kokosing commented 3 years ago

The change you are proposing is more about semantic analysis, about what SELECT * means. Changes in access control and more or less an implementation detail.

electrum commented 3 years ago

The way we handle access control for SELECT * vs SHOW COLUMNS is actually inconsistent. We filter the columns in SHOW COLUMNS via AccessControl.filterColumns(), but SELECT * returns them all, and then the AccessControl.checkCanSelectFromColumns() check will expose the column names via the error message. Changing analysis to filter the column list through filterColumns() would be consistent and prevent this information leakage. There may be some downsides or implications that I'm not considering.

This change would let admins choose the behavior they want. If they simply implement checkCanSelectFromColumns(), then tables have all columns, but access is restricted (and the columns are visible via DESCRIBE and error messages). If they want to hide columns, then they implement filterColumns().

There is actually another related concept, which is hidden columns. Today, ConnectorTableMetadata always returns the full set of columns in the table, but some of the ColumnMetadata objects can have the hidden flag set, which means they do not appear in either SELECT * or DESCRIBE.

martint commented 3 years ago

From a standard SQL perspective, and as @findepi mentioned above, * represents all the columns in the underlying table. The permissions checks are based on whether the user has SELECT permissions on all the columns referenced in the query.

What you're proposing amounts to returning different metadata depending on who's querying a table. This is something connectors can already do today. ConnectorMetadata.getTableHandle() takes a ConnectorSession object which contains ConnectorIdentity getIdentity(), which, in turn, contains the information about the user. Based on that information, the connector can choose to present a subset of the columns as part of table metadata, so SELECT * would only render columns the user can see.

findepi commented 3 years ago

the connector can choose to present a subset of the columns as part of table metadata, so SELECT * would only render columns the user can see.

The SAC is engine-level, so the connector may or may not know which columns users can see. Also, i think there is a high value in making the experience consistent.

Unless re-reading the spec yields a different interpretation, I think we should add an engine level, opt-in feature to limit SELECT * to accessible columns only (including those with masks, of course).

youngchen7 commented 3 years ago

Using ConnectorMetadata.getTableHandle() to filter columns based on identity is definitely a viable approach. But I think @findepi and @electrum brought up good points regarding having this behavior be consistent with the SystemAccessControl.filterColumns that exists now. Having the same filtering be applied to both SELECT and SHOW/DESCRIBE makes a lot of sense in my opinion.

If we were to take the SAC approach, would it be better to reuse filterColumns or add another method to the interface?

Right now I'm thinking about adding the filtering in StatementAnalyzer.visitTable and StatementAnalyzer.visitInsert - basically wherever metadata.getTableHandle() is called in the analyzer. Let me know how that sounds as an initial approach.

martint commented 3 years ago

Unless re-reading the spec yields a different interpretation, I think we should add an engine level, opt-in feature to limit SELECT * to accessible columns only (including those with masks, of course).

No, the spec is very clear about this. * is, effectively, syntactic sugar for enumerating the files in a ROW type (given that the type of a table is MULTISET(ROW(...)).

If we're talking about filtering columns, we're not just talking about SELECT * -- there's nothing magic about it. Any direct column reference would be filtered out, too. It would need to be implemented as part of analysis, during table resolution (specifically, here: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java#L1278). So, for all intents and purposes, it would behave as if we had user-sensitive schemas, as I described above.

As an aside, I think there's an API design issue with having both filterColumns and checkCanSelectFromColumns methods. They both do similar, overlapping actions but can get out of sync. We should think about whether there's an opportunity to collapse them into a single method. We use filterColumns to hide columns in information_schema tables, and we're using that as a proxy to indicate "whether the user has SELECT permissions on those columns", which is how the spec describes what's visible in those tables.

findepi commented 3 years ago

I think there's an API design issue with having both filterColumns and checkCanSelectFromColumns methods. They both do similar, overlapping actions but can get out of sync.

Let's have a separate issue about this.

If we're talking about filtering columns, we're not just talking about SELECT * -- there's nothing magic about it. Any direct column reference would be filtered out, too. It would need to be implemented as part of analysis, during table resolution (specifically, here: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java#L1278). So, for all intents and purposes, it would behave as if we had user-sensitive schemas, as I described above.

Makes sense to me. This would mean the inaccessible columns are effectively not part of the table relation

I propose the toggle name hide-inaccessible-columns.

youngchen7 commented 3 years ago

(Copying discussion from slack channel for more context here)

martin 8 days ago

I’m still not convinced we want to make it configurable (vs something the connector controls). It’s yet another configuration option users have to deal with...

findepi 8 days ago

@martin which connectors would you expect this to implement this behavior? Do you see they should do this unconditionally, without a configuration toggle? What about the case when connector does not know what’s going to be available to the user (system access control) (if engine follows SQL specification out of the box, should connectors do too?)

martin 8 days ago

I’d expect this to be something an access control plugin implements. I’m wary of having too many configuration knobs. It makes the system less user friendly, harder to test, etc. in general, we should try to keep configuration options to a minimum (with an understanding that sometimes we have to because the engine is not yet smart enough to do something - eg tuning options)

Young Chen 8 days ago

I see, that makes sense. In this case what are your thoughts on exposing this as a new method in SystemAccessControl? This would allow specific implementations to toggle this behavior. However, this adds another challenge in that it complicates the contract that AccessControl implementations have to implicitly follow (like how checkCanSelectColumns should, but is not required to, be in sync with filterColumns)

findepi 6 days ago

I’d expect this to be something an access control plugin implements. I’m wary of having too many configuration knobs. @martin i agree, the fewer knobs the better. But that plugin would need to have a knob as well. Am i missing something? Also, which access control plugin(s) (connector AC / system AC) do you have in mind? A new implementation or amending an existing one(s)? (edited)

Young Chen 24 hours ago

I guess the "knob" in this case could be a new method in the AccessControl interface, and the control is just whether the plugin implements it or not or however the plugin wants to toggle this behavior. That would remove the knob from Trino and move it under the plugin. I am a little unclear on where it makes more sense to have the knob be located though - since it is a significant change in how the engine behaves would it be better off configured in Trino? I'm aiming to have a new implementation to integrate with AWS LakeFormation (probably as SystemAccessControl). I'll need to clean up the code, and we would most likely publish it as a plugin (similar to how Ranger publishes their own Trino plugin). If there's any example AccessControl implementations that would be beneficial to amend (maybe FileBased?) I'll do that as well. (edited)

findepi 19 hours ago

I am quite convinced this shouldn’t be an aspect of one chosen ConnectorAccessControl or SystemAccessControl. It seems to me Martin is quite convinced about the opposite though. I would recommend getting on the same page before any code is written, to avoid wasted work.

Young Chen 18 hours ago

Ah alright. So the two approaches we're trying to decide between are: Implement this as an engine-level option and have it be invisible from the AccessControl interface's perspective (this would most likely entail reusing filterColumns in the analyzer and adding a feature flag in Trino to control this) -or- Add another interface method to the AccessControl SPI and let AccessControl plugin implementations decide how/when to enable/disable this filtering behavior. Does that pretty much sum it up? If so I'll update the OSS issue with this context for better visibility & record keeping and hopefully we can get to a resolution there. (edited) New

kokosing 5 hours ago

Also think we should not use io.trino.security.AccessControl#filterColumns for this but introduce another method that will be only called for for filtering columns from SELECT * . Maybe io.trino.security.AccessControl#filterSelectableColumns which by default would return the same set of columns as input. For most existing access controls it will be sufficient. But that plugin would need to have a knob as well. Am i missing something? So knob would be turned on when `filterSelectableColumns is implemented.

findepi 4 hours ago

@kokosing why filterColumns is not appropriate?

kokosing 4 hours ago

Because then you need to have a knob somewhere. That method is used for metadata listing operation so you can see all accessible columns there, while not allowing to raise “access denied” in SELECT * when you try to access restricted columns.

kokosing 4 hours ago

I don’t want to change the semantic of the existing method. That would make it more complex to reason and to understand for anyone who wants to implement access control plugin. (edited)

findepi 3 hours ago

if the knob is turned, then in statement analyzer (probably) limit table columns to those visible (and filterColumns seems it can be used for this, without any changes)

Young Chen 3 minutes ago

I'm also not sure if filterSelectableColumns is the correct signature, even. As @martin mentioned in the OSS ticket it would filter not just SELECT but any reference to the column, essentially presenting it as a different schema depending on user access permissions.

Young Chen < 1 minute ago

Another point is if we add another method we will be adding to the problem of keeping these methods (filterColumns, checkCanSelectFromColumns, filterSelectableColumns) in sync. If a plugin implements these methods partially or inconsistently the end experience for the user could potentially be very confusing.

youngchen7 commented 3 years ago

Young Chen 19 days ago

I think a few core questions here @kokosing is: When we filter metadata, is there value in not filtering or denying access to the corresponding data? If there is, then separating these methods makes sense If there isn't, then the next decision to make is whether to throw or filter on data access (which suggest to me that we can consolidate these methods and then add a knob to control throwing AccessDenied vs filtering the column out)

kokosing 18 days ago

Let’s change filterSelectableColumns into filterTableSchema to address https://trinodb.slack.com/archives/CP1MUNEUX/p1619019281387300?thread_ts=1618261084.236200&cid=CP1MUNEUX Young Chen I'm also not sure if filterSelectableColumns is the correct signature, even. As @martin mentioned in the OSS ticket it would filter not just SELECT but any reference to the column, essentially presenting it as a different schema depending on user access permissions. From a thread in #dev | Apr 21st | View reply

kokosing 18 days ago

Another point is if we add another method we will be adding to the problem of keeping these methods (filterColumns, checkCanSelectFromColumns, filterSelectableColumns) in sync. If a plugin implements these methods partially or inconsistently the end experience for the user could potentially be very confusing.

No. filterColumns and checkCanSelectFromColumns should get only columns that are after filterTableSchema.

kokosing 18 days ago

When we filter metadata, is there value in not filtering or denying access to the corresponding data? If there is, then separating these methods makes sense

I think it is, because this the way it works to day. One might say that current “metadata filter” just allows all columns and then we throw access denied on restricted columns.

Young Chen 18 days ago

I'm okay with implementing this as another method in the AccessControl SPIs. @martin @findepi thoughts?

martin 3 days ago

The reasons I suggested making it part of the access control behavior are that: a global toggle doesn’t allow customization on a per-connector or per-access-control basis a configuration option requires someone to set the option — there’s no way to build a connector or access control that behaves that way unconditionally. Depending on whether we think it’s necessary to be able to customize that behavior per schema, or even per table, we need a new/updated API similar to filterColumns or a new API that returns a boolean indicating whether to filter inaccessible columns (similar to how we do things like supportsMissingColumnsOnInsert or usesLegacyTableLayouts in ConnectorMetadata) (edited)

Young Chen 3 days ago

Ah that's a good point. An engine level config would have less granularity than having this be another interface method in AccessControl. Since AccessControl interfaces are aware of schema/table level requests accesscontrol plugins could make this decision for themselves with higher granularity than if it were just a trino configuration flag. It seems to me like the two choices are: If we use an engine level configuration, we can enforce some more logical behaviors on AccessControl implementations (E.g. if filtering table schema is enabled, guarantee that it's filtered the same way as metadata). The downside here would be AccessControl plugins lose fine grained control and may have use cases that are unsupported If we delegate this to AccessControl plugins they get very fine grained control on exactly what to filter and when, but at the risk that the implementations end up doing some strange things that don't make sense (showing a different subset of columns when querying metadata vs table schema, IMO, would make very little sense) Going with 2 as martin suggests here makes sense to me. @findepi I think you were advocating for 1 more heavily - does my understanding align with what you're thinking? And what are your thoughts on this now?

findepi 3 days ago

I don’t understand this @martin’s reasoning yet:

a global toggle doesn’t allow customization on a per-connector or per-access-control basis

to me it sounds like we’re talking about “language feature” — something that is part of SQL spec and we’re deliberately departing from the spec here, as an option — i don’t see why i would want the behavior to be controlled on per-connector basis. I see it as a downside (a step against consistency from user perspective) unless per-connector configurability is really important.

a configuration option requires someone to set the option — there’s no way to build a connector or access control that

behaves that way unconditionally. if this was part of an access control implementation, it would also need to be enabled (as a property of an existing AC, or by enabling some new AC) i get the point about building a connector that would behave the desired way (unlike other connectors) — as long as this really is supposed to be a per-connector thing … that having said, it shouldn’t stop you @Young Chen from proceeding along the plan Martin set above.

martin 3 days ago

It’s a departure from the spec if we think about it as a general language feature. However, one could argue that connectors can behave in any way they want and do things that are not quite spec compliant (e.g., a connector choosing a different actual type in response to CREATE TABLE when it doesn’t support a given type, not supporting case-sensitive identifiers, on-demand schemas and tables, etc).

youngchen7 commented 3 years ago

(Slack discussion here: https://trinodb.slack.com/archives/CP1MUNEUX/p1618261084236200)

rodolfototaro commented 2 years ago

Hi all,

I has a similar need so I'm investigating on the code how to create a patch for that. I need at the same time:

I found out this open issue just now but I'm following a different approach and it seam to work, you can see the patch (just few lines) here:

https://github.com/trinodb/trino/compare/9f915ca4c2807e49c244acf828ca449a5415baaf...cuebiq:7461-select-star

My idea is to complete the feature making the behaviour configurable through the a session property like the following:

[
    {
        "sessionProperties": {
            "hide-inaccessible-columns": true
        }
    }
]

What do you think about?

Can you tell me if you see problem with my approach? In case you like it, can you suggest which existing unit tests could be a starting point for creating a test suite for this new behaviour?

thx

Rodolfo

youngchen7 commented 2 years ago

Hi,

Unfortunately I've had other work to prioritize, so I haven't gotten back to this issue recently.

One question regarding your change - looks like it's implemented in analyzeSelectAllColumns. What happens if the user specifies the columns explicitly? It seems like your solution won't hide the presence of the column in that case if I'm understanding correctly.

rodolfototaro commented 2 years ago

Hi,

Unfortunately I've had other work to prioritize, so I haven't gotten back to this issue recently.

One question regarding your change - looks like it's implemented in analyzeSelectAllColumns. What happens if the user specifies the columns explicitly? It seems like your solution won't hide the presence of the column in that case if I'm understanding correctly.

Hi, Thank you for your review, I fixed it as you can see here:

https://github.com/trinodb/trino/compare/55f9c7a9e3a143b61b1bb4b0d0bbfccaea0f8f03...cuebiq:49d5e74cdff7db3cef55e4d02228b9f4a5e8519a

Any suggestion is welcome

youngchen7 commented 2 years ago

Btw, realize I didn't link the alternative approach PR directly here (it was referenced from the original approach PR https://github.com/trinodb/trino/pull/7893) so linking now: https://github.com/trinodb/trino/pull/8752 - I think this one is similar to what you have, except the filter is only applied in analyzeTableOutputFields.

One thing I noticed in your latest change is when filtering in analyzeSelectSingleColumn, you check accessControl.checkCanSelectFromColumns - however when filtering in analyzeSelectAllColumns you check accessControl.filterColumns.

I believe per previous discussion these two SystemAccessControl methods are not guaranteed to be in sync:

The way we handle access control for SELECT vs SHOW COLUMNS is actually inconsistent. We filter the columns in SHOW COLUMNS via AccessControl.filterColumns(), but SELECT returns them all, and then the AccessControl.checkCanSelectFromColumns() check will expose the column names via the error message. Changing analysis to filter the column list through filterColumns() would be consistent and prevent this information leakage. There may be some downsides or implications that I'm not considering.

This change would let admins choose the behavior they want. If they simply implement checkCanSelectFromColumns(), then tables have all columns, but access is restricted (and the columns are visible via DESCRIBE and error messages). If they want to hide columns, then they implement filterColumns().

rodolfototaro commented 2 years ago

I created a pull request for addressing the issue https://github.com/trinodb/trino/pull/9991