ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 585 forks source link

Change API to connect to a backend #2532

Closed datapythonista closed 2 years ago

datapythonista commented 3 years ago

Currently, the API to connect to a backend, with a backend specific option, is as follows:

import ibis

ibis.options.impala.temp_db = 'foo'

conn = ibis.impala.connect(host='impala',
                           database='ibis_testing',
                           hdfs_client=ibis.hdfs_connect(host='impala', port=50070))

I think it has few drawbacks, in particular:

An alternative API that I personally find easier is:

import ibis

conn = ibis.connect(engine='impala,
                    host='impala',
                    database='ibis_testing',
                    hdfs_client=ibis.hdfs_connect(host='impala', port=50070),
                    temp_db='foo')

I think this API reduces the magic significantly. There is a function that once called will look for the backend and connect to it. If the backend doesn't exist, we can provide a clear error message.

The only drawback I see is that the signature of connect won't be very specific ibis.connect(engine, **kwargs). The backend connect function will have the parameters clearly documented, so I don't think it makes a big difference for the documentation. But for introspection we would have **kwargs.

Something that could make things simpler is to use a connection string instead of **kwargs:

conn = ibis.connect(conn_str='impala://user@impala/ibis_testing',
                    hdfs_client=ibis.hdfs_connect(host='impala', port=50070),
                    temp_db='foo')

This should work directly with SQLAlchemy backends and OmniSciDB, I guess some backends may require to parse the url in the Ibis backend, but that doesn't seem like a big deal.

Somehow unrelated, I'd move backend specific options to the connection object, instead of having them as options. I think this will make things easier and clear, both for Ibis maintainers and for users.

I'd add the ibis.connect method for 2.0, and a FutureWarning for ibis.<backend>, and remove the latter in Ibis 3.0.

@jreback happy with this?

matthewmturner commented 3 years ago

I can give this a shot.

@jreback were you ok with it?

jreback commented 3 years ago

i will have a look at the proposal soon

jreback commented 3 years ago

conceptually this is a good idea. However I also don't like the kwargs passing. I don't see any easy way around this, except putting all the named args in a biggish signature that unites the backends, but i don't think this is a good idea for other reasons.

I am not sure this is suffering from issues @datapythonista as you describe. in particular

It's not very intuitive, since ibis.impala seems to be a impala module in ibis, but it was originally ibis/impala/api.py, now it's ibis/backends/impala, and potentially it can be a module in any other package

a user doesn't know / care about this, it either works or it doesn't

The code becomes trickier, since we need to use getattr, load the backends dynamically, but ideally keeping the attributes f ibis introspectable, including backends

not so concerned about this we already have the code

The ibis namespace is already huge, mixing special attributes for backends makes things more complex

not so concerned about this

When a backend is misspelled or inexistent, the error message is not very intuitive, since we can't know if the user was trying to call a backend, or a regular attribute

THIS is the reason to do something.

so another possibility is to do

ibis.connect(engine='imapala').config() where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.

matthewmturner commented 3 years ago

@datapythonista any thoughts on the above? conceptually makes sense to me - in particular to not have kwargs.

I can take a stab if youre both in agreement on the proposal from @jreback .

As it relates to the naming the method after connect an alternative to config could be params

datapythonista commented 3 years ago

ibis.connect(engine='imapala').config() where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.

I personally find this syntax as complicated and counter-intuitive as the existing one. ibis.connect(engine='impala') won't be connecting, but returning a class or a module, similar as we are doing now with ibis.impala. I think it does help fix the ambiguity of thinking that ibis.impala is a module itself, but I find it unnecessarily cumbersome compared to ibis.connect('impala://user@impala/ibis_testing').

If you prefer the two calls approach, I think we should go with SQLAlchemy way:

engine = ibis.create_engine('impala://user@impala/ibis_testing')
connection = engine.connect()

Or a variant, moving the parameters to connect, if you want to avoid the connection string:

engine = ibis.create_engine('impala')
connection = engine.connect(host='impala', user='me')

Feels like it's more verbose than needed and a bit cumbersome, but at least all objects are conceptually clear. The second option of moving the parameters to connect will be confusing for SQLAlchemy users, and while introspection will work, the documentation will still be split by backend.

So, my preference would be simply:

ibis.connect('impala://user@impala/ibis_testing')

But fine with the two previous approaches too, if you a good reason for a two call approach.

tswast commented 3 years ago

I could see this being a problem for the "credentials" argument to the BigQuery backend, but I imagine most people are using default credentials.

For custom authentication, we could potentially pass in URL-encoded JSON as a URL parameter in the connection string. Or perhaps string arguments for common authentication mechanisms like a path to service account key.

datapythonista commented 3 years ago

Thanks for the heads up @tswast, that's useful to know. I'm not so familiar with BigQuery, would something like this make sense?

ibis.connect('bigquery://project-id', credentials='credentials.json')

Or maybe credentials being a file descriptor, or the json string, or whatever.

Having **kwargs in connect specific to each backend is something that I think we surely need to have. So, providing the credentials doesn't seem an issue to me even if they can't be easily added to the connection string.

What would be required, would be to have a connection string. I think you need to provide a project id, so I guess that the one in the connection string in the example could make sense. But what would feel a bit strange I think (not a technical problem, but a counter-intuitive API) would be if the project id or whatever is also in **kwargs, since the syntax then would need to be ibis.connect('bigquery://', project_id='whatever', ...).

tswast commented 3 years ago

Yes, bigquery://project-id makes sense to me. Honestly, I'd probably end up doing some refactoring to maybe share some connection string logic with the SQLAlchemy connector https://github.com/mxmzdlv/pybigquery#connection-string-parameters

    'bigquery://some-project/some-dataset' '?'
    'credentials_path=/some/path/to.json' '&'
    'location=some-location' '&'
    'arraysize=1000' '&'
    'clustering_fields=a,b,c' '&'
    'create_disposition=CREATE_IF_NEEDED' '&'
    'destination=different-project.different-dataset.table' '&'
    'destination_encryption_configuration=some-configuration' '&'
    'dry_run=true' '&'
    'labels=a:b,c:d' '&'
    'maximum_bytes_billed=1000' '&'
    'priority=INTERACTIVE' '&'
    'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&'
    'use_query_cache=true' '&'
    'write_disposition=WRITE_APPEND'
datapythonista commented 3 years ago

That sounds great. And standardizing with SQLAlchemy sounds like a great idea.

datapythonista commented 3 years ago

Next are the parameters of the connect method of each backend, to have them handy to see if it makes sense to use connection strings, or what a better API could be.

Connection string (host, user, database...)

Backend connection_string kwargs
bigquery bigquery://project-id/dataset-id credentials
application_name
auth_local_webserver
auth_external_data
auth_cache
clickhouse clickhouse+driver://user:password@host:port/database client_name
compression
impala impala+driver://user:password@host:port/database timeout
use_ssl
ca_cert
auth_mechanism
kerberos_service_name
pool_size
hdf_client
mysql mysql+driver://user:password@host:port/database
postgres postgres+driver://user:password@host:port/database
mssql mssql+driver://user:password@host:port/database odbc_driver
omniscidb omniscidb+driver://user:password@host:port/database protocol
session_id
ipc
gpu_device

Connection string (path)

Backend connection_string kwargs
csv csv+driver://path
hdf5 hdf5+driver://path
parquet parquet+driver://path
sqlite sqlite+driver://path create
Python object Backend params
dask dict of dask dataframes
pandas dict of pandas dataframes
arrow dict of arrow recordbatches
spark spark_session
pyspark spark_session

bigquery

clcikhouse

csv

dask

hdf5

impala

mysql

pandas

parquet

postgres

pyspark

spark

sqlite

mssql

omniscidb

tswast commented 3 years ago

I tried adding the needed entry_points to the BigQuery backend in https://github.com/ibis-project/ibis-bigquery/pull/38, but I'm having the hardest time avoiding circular dependencies with the current discovery implementation.

A benefit of a connect() method is that is a little more natural to do lazy evaluation for discovery. That is, don't try to discover all the backends at import time, try to discover them the first time connect() is called.

datapythonista commented 3 years ago

Agree on that. But technically we can also evaluate lazily on __getattr__ of ibis, and load the backends on ibis.bigquery or equivalent. The reason we're not doing that are options. The next would fail, since the backend is not loaded and the option doesn't exist if the backends hasn't been loaded. That's the main reason I don't think we should have backend options.

ibis.options.bigquery.parition_col = 'foo'

ibis.bigquery.connect(connection_string)

On the circular imports, I fail to see how having the backend as a separate repo is affecting it. Everything should be the same, backends are being loaded as entrypoints if they are in this repo too. So, nothing should have changed I'd say. Maybe I'm missing something, but seems to me like the cause should be something else.

tswast commented 3 years ago

Lazy evaluation would definitely help. I think it's that in some of the tests I import ibis_bigquery before I import ibis. I'll look into that further next week

cpcloud commented 2 years ago

Connection string logic seems like a good approach here. There's a lot discussion on this issue, which I appreciate, but it's better to work out the details for something this complex against real code so folks can try it out and collaborate.