trinodb / trino-python-client

Python client for Trino
Apache License 2.0
309 stars 151 forks source link

[sqlalchemy] Improve the performance of TrinoDialect.get_view_names #267

Closed john-bodley closed 1 year ago

john-bodley commented 1 year ago

Description

Querying the information_schema.views table seems to be sub-performant for determining which entities are views—I speculate this is because all the view definitions need to be extracted/compiled. A more performant approach is to use information_schema.tables with the appropriate type filter.

For example for a very large schema at Airbnb (comprising of over 100k entities) the following query,

SELECT
    table_name
FROM 
    information_schema.tables
WHERE 
    table_schema = '<schema>' AND 
    table_type = 'VIEW'

took ~ 5 seconds, whereas:

SELECT
    table_name
FROM
    information_schema.views
WHERE    
    table_schema = '<schema>'

was still running after 10 minutes.

Note there were no prior integration tests for the TrinoDialect.get_view_names method so I added some to cover this logic and threw in a few extra for free which should cover all the various scenarios.

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required. ( ) Release notes are required, please propose a release note for me. (x) Release notes are required, with the following suggested text:

* Improve the performance of get_view_names in SQLAlchemy. ({issue}`267`)
hovaesco commented 1 year ago

@hashhar are you aware from where the performance difference comes from? It might be that query to information_schema.tables do some kind of predicate pushdown due to table_type = 'VIEW'?

john-bodley commented 1 year ago

@mdesmet thanks for the review. I've addressed your comments.

mdesmet commented 1 year ago

@hashhar : PTAL I think we should make this change based on your feedback on Slack

hashhar For views we call metadata.getViews in InformationSchemaPageSource - that methods ends up retrieving view definitions as well. For tables we call listTables + listViews (in InformationSchemaPageSource) - neither of which need view definitions. So actually the tables is faster since it just lists things.