influxdata / flightsql-dbapi

DB API 2 interface for Flight SQL with SQLAlchemy extras.
Apache License 2.0
31 stars 4 forks source link

Using Arrow ADBC Flight SQL driver? #10

Open lidavidm opened 1 year ago

lidavidm commented 1 year ago

This project is quite exciting to see!

I noticed that there's a handrolled Flight SQL client in the repo. Have you considered using the ADBC Flight SQL driver instead? I'm one of the Arrow contributors working on it, and it'll be on PyPI soon™ (I'm working on the 0.2.0 release cycle right now). That would give you a DBAPI-compatible interface to build on for things like the SQLAlchemy integration here.

Also, it would theoretically let you avoid a dependency on PyArrow if you are only using it with things like Polars. (However, some work would be needed to make that possible in the DBAPI interface.)

brettbuddin commented 1 year ago

Howdy, @lidavidm. Yep, I did find out about it all through @jacobmarble. I cut a separate project—aimed directly at the SQLAlchemy usecase—since this one is mostly oriented around the DB API bits. However, I ran into some situations with the ADBC Flight SQL driver where get_objects queries were failing with "UNKNOWN ERROR" type errors. I haven't re-surfaced on the work yet, but will try and get that pushed up soon and cross-linked here.

The aim of this project was to provide a very minimal installation setup to support Apache Superset. As it stands, I think the documentation for getting the ADBC driver compiled is a bit of a stretch for some more layman users. Once that's more developed though, I could see the majority of this project sunsetted in favor of another that only contains the SQLAlchemy dialect parts and the ADBC drivers used for DB API support.

lidavidm commented 1 year ago

Great, and that makes total sense.

Do you have a repro for the GetObjects issue? I got IOx set up (I think) but saw none of the underlying Flight SQL metadata endpoints were implemented. (Or feel free to file an issue on the ADBC repo.)

There are nightly wheels available for the driver, FWIW. But yes, I don't expect most people will want to deal with that or building from source. Once it's on PyPI, hopefully we can see what's missing - I'll be working to extend the driver to cover these use cases.

brettbuddin commented 1 year ago

Excellent. I'll go back through where I left off on the ADBC driver work a few weeks ago and post the errors I see. Stay tuned...

brettbuddin commented 1 year ago

Do you have a repro for the GetObjects issue? I got IOx set up (I think) but saw none of the underlying Flight SQL metadata endpoints were implemented. (Or feel free to file an issue on the ADBC repo.)

The IOx project itself doesn't have much of the Flight SQL protocol implemented at the moment, but we do support metadata calls in InfluxDB Cloud. We're intercepting things and issuing ad-hoc queries downward to IOx until it has those things implemented natively.

Here's a quick snippet of the output I'm getting when running it against IOx in InfluxDB Cloud:

tests/test_example.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py:266: in get_table_names
    return self.dialect.get_table_names(
adbc_flightsql_sqlalchemy/__init__.py:39: in get_table_names
    return connection.connection.adbc_get_objects(
venv/lib/python3.10/site-packages/adbc_driver_manager/dbapi.py:381: in adbc_get_objects
    handle = self._conn.get_objects(
adbc_driver_manager/_lib.pyx:627: in adbc_driver_manager._lib.AdbcConnection.get_objects
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   adbc_driver_manager._lib.Error: ADBC_STATUS_UNKNOWN (1): [FlightSQL] Internal: SqlState:

adbc_driver_manager/_lib.pyx:371: Error
========================================================================== warnings summary ==========================================================================
tests/test_example.py::test_example2
  /Users/brett/Code/influxdata/adbc-flightsql-sqlalchemy/venv/lib/python3.10/site-packages/adbc_driver_manager/dbapi.py:271: Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== short test summary info =======================================================================
FAILED tests/test_example.py::test_example2 - adbc_driver_manager._lib.Error: ADBC_STATUS_UNKNOWN (1): [FlightSQL] Internal: SqlState:
==================================================================== 1 failed, 1 warning in 0.60s ====================================================================
gmake: *** [Makefile:49: test] Error 1

You can disregard the ceremony around the call, but underneath it's just a adbc_get_objects(depth="tables"). In my scenario above I had vendored your Flight SQL driver code, because the PyPI project wasn't published at the time.

Issuing the same call to the SQLite example server doesn't have this problem problem though so I'm sure its something we're doing that's odd to the driver. Unfortunately, the error message isn't all that helpful so it'll take some time to sort out what's going on here.

lidavidm commented 1 year ago

Ah, ok, thanks. I'll see if I can test against the demo of the Cloud version sometime. (Agreed that the error message is worthless.)

zeroshade commented 1 year ago

@lidavidm I guess when we determine the cause of this error we can try to beef up the error reporting in that spot to figure out why the error was so worthless.

lidavidm commented 1 year ago

Yes for sure (frankly that's the first thing I'm concerned about more than the incompatibility itself :sweat_smile:)

brettbuddin commented 1 year ago

It's just as likely that we're doing something odd in our responses from Cloud. It feels like Flight SQL is a bit of the Wild West at the edges. This could be a case where we interpreted the specification differently than some clients are expecting. This library and the Grafana plugin we're building seem to function with our limited requirements and aimed at the SQLite reference implementation or InfluxDB Cloud. However, there's still work to be done to ensure that our responses are compatible with JDBC and ODBC clients as paired with things like PowerBI and Tableau.

Today, instead of using the vendored version of the ADBC driver I'll make sure I use the PyPI package directly. I'll also add some logging into the Go client to see if I can get some insight into what's going on. Will post my findings.

lidavidm commented 1 year ago

Thanks! The PyPI package is not quite ready, so hold tight for a few more days there.

It feels like Flight SQL is a bit of the Wild West at the edges.

Dremio is still (AFAIK) the main vendor implementing it, but I'm happy to see Influx Data also investigating it. This is not quite the right place for this discussion, but I think we should start looking at a Flight SQL integration or compatibility suite - one that actually tries to test some semantics (the integration tests in Arrow are mostly just basic smoke tests against the Flight SQL library, not the implementation on top) as well as the JDBC/ODBC/ADBC drivers. @zeroshade what do you think?

brettbuddin commented 1 year ago

@lidavidm We've been having discussions like this internally, too. I would love to see it exist. We ran into some incongruence in how Handshake is handled between some clients and the Go server interceptor's assumptions. Authn is one of the places in the specification that can't make too many assumptions about the implementation, and purposely leaves some doors open to interpretation. Hopefully a testing strategy like you describe would shine a light on some of these things.

lidavidm commented 1 year ago

Auth will be a little weird, always, unfortunately. That's why the ADBC driver has a 'authorization_header' escape hatch so you can do whatever auth workflow externally and pass the driver a bearer token to use instead. But I think it behooves us to at least describe the existing workflow better. (That applies to Flight in general too and I know I there's a ticket buried deep in the Arrow backlog about that.)

zeroshade commented 1 year ago

@lidavidm I agree completely, it would be great to have a full compatibility suite that users/consumers could use to confirm their implementations are compatible (though obviously it would be better / easier if they just used provided flight sql libs in the Arrow libraries.... but that's not always possible or expected).

lidavidm commented 1 year ago

Well, even if you use the Flight SQL Arrow libraries, you still need to make sure higher-level semantics conform, I think that's the concern here - e.g. whatever is off between the actual InfluxDB implementation of getCatalogs/getDbSchemas/getTables and what the ADBC/JDBC drivers expect.

lidavidm commented 1 year ago

FWIW, I get this with GetObjects:

adbc_driver_manager._lib.Error: ADBC_STATUS_UNKNOWN (1): [FlightSQL] endpoint 0 returned inconsistent schema: expected schema:
  fields: 2
    - catalog_name: type=utf8, nullable
    - db_schema_name: type=utf8 but got schema:
  fields: 2
    - catalog_name: type=utf8
    - db_schema_name: type=utf8

which is maybe overly strict on the part of the driver

zeroshade commented 1 year ago

That would explain it, we've had issues in the past with implementations having different nullabilities in the schema. unfortunately all of our Schema comparison mechanisms enforce that nullability is the same for a field.

lidavidm commented 1 year ago

Yeah, I think we can relax it here (it should be OK to have the schema specify nullable but not the physical data), I think we probably want to avoid anything more than that for now though (we could just cast...)