ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.43k stars 544 forks source link

feat(mssql): use odbc #7317

Closed cpcloud closed 8 months ago

cpcloud commented 8 months ago

This PR moves the MSSQL backend to use ODBC as the underlying driver. This is necessary to support connecting to Azure services. The main user-facing API change is that you now need to set up ODBC to use the MSSQL backend.

Closes #6640. Closes #6012.

xref https://github.com/ibis-project/ibis/discussions/7306

inigohidalgo commented 8 months ago

Wow, that was so fast. Thank you so much for prioritziing this!

I think I am slowly getting it to work, I've had to modify the code a little bit to get it to accept the parameters I needed to. These are some very rough changes I made just to brute-force get the parameters through, I'm sure there's a cleaner way to do this.

https://github.com/inigohidalgo/ibis/commit/1805dda7624cb3860331ce7c61570be4d8b2761d

conn_ib = ibis.mssql.connect(
    host=None,
    port=None,
    user=None,
    query={"odbc_connect": f"Driver=ODBC+Driver+17+for+SQL+Server;SERVER={server};DATABASE={database}"},
    **connect_args
)

where connect_args are those I defined here #7306

It seems that the odbc_connect kwarg I first deleted in the script I posted in the discussion is actually making all the strings passthrough to pyodbc directly sqlalchemy docs

EDIT: just to confirm, using the code block I posted above using the version of the code in my branch, I already successfully queried a simple table! 🙌

cpcloud commented 8 months ago

@inigohidalgo Nice!

I incorporated most of your changes, they look reasonable to me, thank you for that!

inigohidalgo commented 8 months ago

I'm not a huge fan of passing

f"SERVER={server};DATABASE={database}"

as a string in the query field when ibis' API already offers those arguments as part of the connect functions. Those arguments get passed to self._build_alchemy_url but the output of that call returns the wrong format for pyodbc connections.

cpcloud commented 8 months ago

@inigohidalgo What does your ideal call to ibis.connect (or ibis.mssql.connect) look like? Maybe there's a way we can improve it.

inigohidalgo commented 8 months ago

I'm going to have a look around because from what I see I should be able to get the default _build_alchemy_url to work with the host and database parameters: https://stackoverflow.com/a/46101255/9807171

EDIT: although my host looks like "XXX-XXX-XXX.database.windows.net" instead of just a single "host", not sure if that makes a difference.

inigohidalgo commented 8 months ago

Yeah I'm not able to connect without passing "SERVER={server};DATABASE={database}" inside the raw string.

In my ideal call, the host and database parameters would be added to that query={"odbc_connect": f"Driver=ODBC+Driver+17+for+SQL+Server;SERVER={server};DATABASE={database}"} but that seems like a very specific implementation just for my usecase, and it's adding maintenance burden to ibis instead of being handled by sqlalchemy. I'm trying to see if there's a URL builder in SQLAlchemy which handles this but I'm not seeing it anywhere.

cpcloud commented 8 months ago

I'm not sure we can special case odbc_connect usage since it seems there are many ways to connect with odbc. In CI for example, we're using a combination of odbc driver parameters and user/host/password etc.

inigohidalgo commented 8 months ago

Absolutely, it wouldn't make any sense to have such a specific implementation, especially when it seems it SHOULD work from what I'm reading online.

According to this SQLAlchemy should be converting the connection string to the "SERVER={server};DATABASE={database}" but it isn't working.

Also apparently SQLAlchemy adds a parameter Trusted_Connection=Yes which I'm supposed to remove, but I'm having issues understanding where I'm supposed to remove it

@event.listens_for(engine, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}
inigohidalgo commented 8 months ago

At worst since all our connections are going to be done through my kedro wrapper, I can handle this on my side, but that would make that wrapper implementation less of a general Ibis wrapper and more of an Ibis+Azure one.

cpcloud commented 8 months ago

I think to attach that event listener you'd do this:

con = ibis.mssql.connect(...)

@event.listens_for(con.con, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    ...
inigohidalgo commented 8 months ago

That did the trick!

import struct
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL
from azure import identity
import ibis

SQL_COPT_SS_ACCESS_TOKEN = 1256  # Connection option for access tokens, as defined in msodbcsql.h
TOKEN_URL = "https://database.windows.net/"  # The token URL for any Azure SQL database

connection_string = "mssql://XXX.database.windows.net/XXX?driver=ODBC+Driver+17+for+SQL+Server"

conn = ibis.connect(connection_string)

azure_credentials = identity.DefaultAzureCredential()

@event.listens_for(conn.con, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}

conn.table(name="INPUT_VARIABLES_CONN_PT", schema="IGD").to_pandas()

One issue I have run into is that I'm having to monkeypatch query["driver"] = "ODBC Driver 17 for SQL Server" inside the mssql.Backend.do_connect because it is being lost from the connection_string at some point.

Just a heads up I'm gonna be quite busy for the next 24-48h so I'll be a little bit less responsive.

cpcloud commented 8 months ago

One issue I have run into is that I'm having to monkeypatch query["driver"] = "ODBC Driver 17 for SQL Server"

I think you should be able to write

conn = ibis.connect(
    "mssql://XXX.database.windows.net/XXX",
    query=dict(driver="ODBC+Driver+17+for+SQL+Server"),
)

Any **kwargs you pass to connect should be forwarded on to do_connect.

Can you give that a try?

cpcloud commented 8 months ago

Given the complexity of using tokens, I think it makes sense to add a token_provider parameter to the MSSQL backend's do_connect so that you can define your token provider callable without having to manage sqlalchemy events or depend on creating an ibis backend first.

That would make the connection code look like this:

import struct
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL
from azure import identity
import ibis

SQL_COPT_SS_ACCESS_TOKEN = 1256  # Connection option for access tokens, as defined in msodbcsql.h
TOKEN_URL = "https://database.windows.net/"  # The token URL for any Azure SQL database

azure_credentials = identity.DefaultAzureCredential()

def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}

connection_string = "mssql://XXX.database.windows.net/XXX?driver=ODBC+Driver+17+for+SQL+Server"
conn = ibis.connect(connection_string, token_provider=provide_token)
conn.table(name="INPUT_VARIABLES_CONN_PT", schema="IGD").to_pandas()
inigohidalgo commented 8 months ago

Indeed that worked!

What is the more commonly-recommended API, ibis.connect(resource) or ibis.backend.connect(host...)? It seems like as things stand I should be ready to start working using the ibis.connect API.

cpcloud commented 8 months ago

Start with ibis.connect. We try to make sure that it has parity with ibis.${backend}.connect, and it doesn't require you to remember specific argument names (user vs username, etc).

If there's something you can't do with ibis.connect that can do with ibis.${backend}.connect then it's probably a bug 😄

inigohidalgo commented 8 months ago

Regarding token_provider=provide_token, would that be an implementation on Ibis' side?

cpcloud commented 8 months ago

Regarding token_provider=provide_token, would that be an implementation on Ibis' side?

Yep.

It's basically this code at the end of ibis.backends.mssql.Backend.do_connect:

        if token_provider is not None:
            sa.event.listens_for(engine, "do_connect")(token_provider)
cpcloud commented 8 months ago

The main benefit is that you can put your token providing code wherever you want. It doesn't need to be right next to the connection.

inigohidalgo commented 8 months ago

Cool! Is that something you'd be okay with adding into Ibis? I assume there'll be more ppl who have the same issue but when I was searching the issues I couldn't find anyone with the same question as me.

cpcloud commented 8 months ago

Cool! Is that something you'd be okay with adding into Ibis?

Personally, yes, but I'll put up a separate PR for it where we can discuss whether folks think it's a good idea or not.

cpcloud commented 8 months ago

My hunch is that once people are able to use Azure services with ibis then it'll be an issue :)

cpcloud commented 8 months ago

One thing to note is that this is likely not going to be released until 8.0 because it's a pretty big change for MS SQL users (unclear how many of those we have).

It seems like many of the would-be MS SQL users aren't currently using ibis because they require odbc support, so for them this is going from zero to one, but for any existing users they'll have to migrate over to PyODBC.

inigohidalgo commented 8 months ago

this is likely not going to be released until 8.0 because it's a pretty big change for MS SQL users

Not a problem for us, it'll take a while for us to migrate our pipelines to Ibis so in the meantime I'm happy to just build using either :8.x.x or your mssql-pyodbc branch if it's yet to be merged.

cpcloud commented 8 months ago

Sweet, yeah, assuming this gets merged, it'll be done in the 8.x.x branch which we'll rebase on top of master every so often. That'll run through CI like everything else.

inigohidalgo commented 8 months ago

One final kink I see is the difference between Ibis' driver kwarg which defaults to mssql+pyodbc and then the driver I pass in the query parameter. Do you think it makes sense to somehow merge those in some way, or at least distinguish between these? Because I think pyodbc usually requires you to specify the driver version, the naming might be a little bit confusing.

cpcloud commented 8 months ago

Perhaps we can remove the query parameter?

inigohidalgo commented 8 months ago

And provide a pyodbc_driver parameter which gets added to the query inside do_connect?

cpcloud commented 8 months ago

Is the current driver parameter not enough for that? For example, why wouldn't we just stuff driver into a query parameter dict inside do_connect?

inigohidalgo commented 8 months ago

I think the driver you're using atm informs the connection string mssql+pyodbc which tells SQLAlchemy which dialect to use, whereas the pyodbc driver gets added to the end of the connection string and goes through to pyodbc.

cpcloud commented 8 months ago

I think the driver you're using atm informs the connection string mssql+pyodbc which tells SQLAlchemy which dialect to use, whereas the pyodbc driver gets added to the end of the connection string and goes through to pyodbc.

Ah, no, but that's how it used to be before this PR!

In this PR, query and driver are a bit redundant: driver is being stuffed into query.

A point of confusion is that there's the driver argument to URL.create and then the driver query parameter which has nothing to do with the first 😂

cpcloud commented 8 months ago

Hm, I might need to tinker with this a bit to work out how url-based and non-url-based connection works.

inigohidalgo commented 8 months ago

But at the moment, using ibis.connect, driver is everything before the :// whereas I am currently providing the ODBC driver using the query kwarg

https://github.com/ibis-project/ibis/pull/7317#issuecomment-1755038358

Would you keep the query argument in ibis.connect but then ibis.mssql.connect the driver arg would get added into the query dict?

cpcloud commented 8 months ago

But at the moment, using ibis.connect, driver is everything before the :// whereas I am currently providing the ODBC driver using the query kwarg

Those are two different drivers, unrelated to each other.

The first (mssql://) is SQLAlchemy's notion of a driver. I'm currently hardcoding that into _build_alchemy_url as mssql+pyodbc. This driver is a fixed value.

The second is the PyODBC notion of driver, which is the thing being put into query. The driver argument to do_connect corresponds to the PyODBC notion of driver.

inigohidalgo commented 8 months ago

Ok, that makes sense. I'm not sure if it might lead to a bit of confusion down the line, but for a start, keeping the query in ibis.connect but using the driver in do_connect to inform the PyODBC driver through the query could work.

cpcloud commented 8 months ago

I think to make query and driver less confusing will require some deeper work, across backends, to differentiate query parameters from connect_args.