snowflakedb / snowflake-sqlalchemy

Snowflake SQLAlchemy
https://pypi.python.org/pypi/snowflake-sqlalchemy/
Apache License 2.0
233 stars 152 forks source link

SNOW-948287: Best practice for doing cross database joins using CORE #456

Open jasheldo opened 1 year ago

jasheldo commented 1 year ago

I am literally copying this from the ORM question because it's virtually identical but the ORM approach doesn't work for CORE.

Question

Hi there, in Snowflake we often have multiple databases, with their own set of schemas, and tables inside those schemas. Ideally data shouldn't have to move between these databases nor schemas, but in practice they often do.

One use case is when there might be valuable data between two different databases. In my example, I'm going to reference database A and database B which has schemas ant and baboons, which has tables apples and bananas, respectively.

Through a direct query using Snowflake SQL, you could implement a join like

select
    *
from A.ant.apples a
join B.baboons.bananas b
    on a.id = b.id

How can you implement a cross database join like the above example using a single session through the CORE interface? My understanding is that a session can only point towards one database. I'm not creating an app and the tables already exist so ideally this would all happen through SQLAlchemy MetaData object reflection. Below are some example of what I've tried.

The ideal approach would look like the following:

import os

from sqlalchemy import create_engine, MetaData, text, Table
from snowflake.sqlalchemy import URL
from sqlalchemy.sql import select, and_, or_, case, func, insert

# Note: no database, schema and warehouse are being selected at this time. They're not needed yet.

engine = create_engine(URL(
    account="my_awesome_account",
    user="my_user",
    password="my_password",
    role="my_awesome_role",
    ),
    )
metadata = MetaData()

metadata.reflect(engine, schema='a.ant', only=['apples',]) # doesn't work
metadata.reflect(engine, schema='a.ant', only=['a.ant.apples',]) # doesn't work
metadata.reflect(engine, only=['a.ant.apples',]) # doesn't work

Table("apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work
Table("a.ant.apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work

I've tried all types of combinations of cases with the the database name and schema names too. None of them work. What I find silly about the above is I often get StackTraces saying I have to provide a fully qualified name since I'm not declaring a database. But I AM providing a fully qualified name. And not explicitly choosing a database is the point. I need to pull data from tables in multiple databases so the reflected tables NEED TO BE FULLY QUALIFIED.

Can this be done through the CORE interface without dropping to SQL? If so, what's the best practice in doing this?

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.9.6

  2. What operating system and processor architecture are you using?

    macOS-14.0-arm64-arm-64bit

  3. What are the component versions in the environment (pip freeze)?

    appnope==0.1.3 asn1crypto==1.5.1 asttokens==2.4.0 backcall==0.2.0 certifi==2023.7.22 cffi==1.16.0 charset-normalizer==3.3.0 comm==0.1.4 cryptography==41.0.4 debugpy==1.8.0 decorator==5.1.1 exceptiongroup==1.1.3 executing==2.0.0 filelock==3.12.4 idna==3.4 importlib-metadata==6.8.0 ipykernel==6.25.2 ipython==8.16.1 jaraco.classes==3.3.0 jedi==0.19.1 jupyter_client==8.4.0 jupyter_core==5.4.0 keyring==24.2.0 matplotlib-inline==0.1.6 more-itertools==10.1.0 nest-asyncio==1.5.8 numpy==1.26.1 oscrypto==1.3.0 packaging==23.2 pandas==2.0.3 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 platformdirs==3.11.0 prompt-toolkit==3.0.39 psutil==5.9.6 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pycparser==2.21 pycryptodomex==3.19.0 Pygments==2.16.1 PyJWT==2.8.0 pyOpenSSL==23.2.0 python-dateutil==2.8.2 pytz==2023.3.post1 pyzmq==25.1.1 requests==2.31.0 six==1.16.0 snowflake-connector-python==3.3.1 snowflake-sqlalchemy==1.5.0 sortedcontainers==2.4.0 SQLAlchemy==1.4.49 sqlparse==0.4.4 stack-data==0.6.3 tomlkit==0.12.1 tornado==6.3.3 traitlets==5.11.2 typing_extensions==4.8.0 tzdata==2023.3 urllib3==1.26.18 wcwidth==0.2.8 zipp==3.17.0

  4. What did you do?

    The above example should be sufficient.

  5. What did you expect to see?

    Ideally, the fully qualified table objects, including the database, would have been instantiated as SQLAlchemy Table objects. In this way, I could then create a SQLAlchemy select statement such as select(a).select_from(a.join(b, on=[a.c.col1 == b.c.col1]) and the corresponding SQL generated would be SELECT A.ant.apple.* FROM A.ant.apple JOIN B.baboons.bananas ON A.ant.apple.col1 = B.baboons.bananas.col1;.

Instead, the reflection step fails forcing me to choose a database which then prevents me from joining tables in different databases.

  1. Can you set logging to DEBUG and collect the logs?

    import logging
    import os
    
    for logger_name in ['snowflake.sqlalchemy', 'snowflake.connector']:
      logger = logging.getLogger(logger_name)
      logger.setLevel(logging.DEBUG)
      ch = logging.StreamHandler()
      ch.setLevel(logging.DEBUG)
      ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
      logger.addHandler(ch)

I'm happy to provide this info but in private. It's on a company computer with company backend stuff that's not for the public eye.

jasheldo commented 1 year ago

Has anyone had a chance to review?

sfc-gh-pgifford commented 1 year ago

Jim - Do you have access to this: https://snowflakecomputing.atlassian.net/wiki/spaces/SKE/pages/1065292364/JOIN+Across+Databases+SYNONYM

This article provides alternatives for How To get around this using session variables and optionally identifier()

Details The need is to join data from two databases (which are environment specific) such as:

SELECT cd.cust_id, ff.measure_nbr, ... FROM ent_dev_edw.bas.cust_d cd JOIN lcl_dev_wrk.aml.fact_f ff ON ff.cust_sk = cd.cust_sk WHERE ... Use Variables

set dbEntNm = 'ent_dev_edw.'; set dbLclNm = 'lcl_dev_wrk.'; set cdNm = $dbEntNm || 'bas.cust_d'; set ffNm = $dbLclNm || 'amt.fact_f';

... FROM $cdNm cd -- or use this syntax FROM TABLE($cdNm) or this syntax FROM IDENTIFIER($cdNm) JOIN $ffNm ff ON ff.cust_sk = cd.cust_sk ... Honestly, it's a pain when you want flexibility for a lot of tables.

We do have a Synonym jira ticket out there and I will add your company to it.

-Pete

jasheldo commented 11 months ago

I hacked my way around it but because of how our org is set up, I'm unable to run the test suite to verify it doesn't break anything. If someone could help out, that'd be wonderful. All edits were made to snowdialect.py, attached here. snowdialect.py.zip

Bottom line, there are two sections in this file around reflection. I added a bit of code to check to see if the schema has a . in it. And if so, split the schema into database and schema. The information_schema is also being fully qualified in the code by changing the subsequent query to an f-string and including the database. Just run git diff on the attached zip file to see all the changes.