snowflakedb / snowflake-sqlalchemy

Snowflake SQLAlchemy
https://pypi.python.org/pypi/snowflake-sqlalchemy/
Apache License 2.0
231 stars 152 forks source link

SNOW-250319: Enhancement Request: Performance improvement for SnowflakeDialect._get_schema_columns #204

Closed beherap closed 1 year ago

beherap commented 3 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using (python --version)? Python 3.7.6
  2. What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')? Darwin-19.4.0-x86_64-i386-64bit
  3. What are the component versions in the environment (pip list)?
    
    Package                    Version
    -------------------------- ---------
    asn1crypto                 1.4.0
    azure-common               1.1.26
    azure-core                 1.9.0
    azure-storage-blob         12.6.0
    boto3                      1.16.40
    botocore                   1.19.40
    certifi                    2020.12.5
    cffi                       1.14.4
    chardet                    3.0.4
    cryptography               3.3.1
    idna                       2.10
    isodate                    0.6.0
    jmespath                   0.10.0
    msrest                     0.6.19
    oauthlib                   3.1.0
    oscrypto                   1.2.1
    pip                        20.3.3
    pycparser                  2.20
    pycryptodomex              3.9.9
    PyJWT                      1.7.1
    pyOpenSSL                  19.1.0
    python-dateutil            2.8.1
    pytz                       2020.4
    requests                   2.23.0
    requests-oauthlib          1.3.0
    s3transfer                 0.3.3
    setuptools                 51.0.0
    six                        1.15.0
    snowflake-connector-python 2.3.7
    snowflake-sqlalchemy       1.2.4
    SQLAlchemy                 1.3.22
    urllib3                    1.25.11
    wheel                      0.36.2

4. What did you do?
I am trying to the list of columns for a list of snowflake tables in a schema in a `for` loop.

5. What did you expect to see?
The current methods `SnowflakeDialect._get_table_columns` and `SnowflakeDialect._get_schema_columns` use `information_schema.columns` table to get the list of columns which is much slower as compared to `show columns in schema <db>.<schema>` and `show columns in <db>.<schema>.<table_name>`

6. What did you see instead?
The execution of queries against `information_schema.columns` takes couple seconds as compared to `show columns` which takes only few milliseconds.

7. Can you set logging to DEBUG and collect the logs?
N/A
sfc-gh-mkeller commented 3 years ago

@beherap I will reach out to the internal team that deals with information_schema to see what they say. I was under the impression that these two work the same way.

Please feel free to ping me again if it looks like I forgot about this ticket!

beherap commented 3 years ago

@sfc-gh-mkeller Thank you for looking into this one. I refactored the code a bit to use the show columns statement. I am going to create a PR for this shortly.

beherap commented 3 years ago

@sfc-gh-mkeller Is it possible for you to add me as a contributor to the repo?

sfc-gh-mkeller commented 3 years ago

I’m sorry, I cannot!

On Dec 22, 2020, at 23:55, Pb notifications@github.com wrote:

 @sfc-gh-mkeller Is it possible for you to add me as a contributor to the repo?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

beherap commented 3 years ago

I’m sorry, I cannot! On Dec 22, 2020, at 23:55, Pb @.***> wrote:  @sfc-gh-mkeller Is it possible for you to add me as a contributor to the repo? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

@sfc-gh-mkeller Alright! Did you have a chance to connect with the internal team yet? Thanks!

datGnomeLife commented 3 years ago

See this How to article posted November 2020. It states show queries use a special cache for better performance, but with a limitation of 10k records. You already have some try catch logic in the snowdialect for the information catalog. If snowflake throws some sort indication that 10k threshold has been exceeded, could probably adapt it to first try with show and if it exceeds 10k then retry with the information catalog. Another advantage is "show" does not require a running warehouse to execute.

beherap commented 3 years ago

Hi @datGnomeLife! Thank you for the update. Sounds like the preference of using information_schema.columns over show columns is due to the 10k threshold limitation. Is that somewhat correct?

bastienboutonnet commented 3 years ago

I would also be interested in knowing why that is the case as I have a similar question to this one here: https://github.com/snowflakedb/snowflake-sqlalchemy/issues/221

github-actions[bot] commented 1 year ago

To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of Apr 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response