trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.35k stars 2.98k forks source link

Access control storm (performance hit) on AccessControlManager. Query queue fills up #15167

Closed dprophet closed 1 year ago

dprophet commented 1 year ago

With the Apache Ranger plugin PR I am noticing very large performance hits when Trino is attached to data storage systems with lots of Tables

One use case is a postgres catalog where one of the schemas has 100,000 tables (yes a real use case)

This code https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/connector/system/jdbc/TableJdbcTable.java#L102 (public RecordCursor cursor) Calls https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/security/AccessControlManager.java#L535 (public Set filterTables)

In my use case, I have 100k tables. That code is walking all 100k tables to see if they are filtered out.

It takes a long time. If any logging is turned on and you are using resource-groups, the Trino query queue will fill up causing errors.

Whats the purpose of the above code? Its really causing some issues at scale.

Praveen2112 commented 1 year ago

TableJdbcTable provides the table information for tables across all the catalogs. AccessControlManager performs a bulk check for all the tables under a given schema (if the filter on schema is specified)... ConnectorAccessControl also gets Set<SchemaTableName> to be filtered. I think the Ranger's ConnectorAccessControl walks through all 100k tables individually to filter it - so we might need to fix there to avoid queue filling up

dprophet commented 1 year ago

This can be closed. I changed the ranger-plugin to ignore row filtering when it the information_schema. This is how the FileBasedSystemAccessControl solves the problem.