Change API to connect to a backend

datapythonista commented 3 years ago

Currently, the API to connect to a backend, with a backend specific option, is as follows:

import ibis

ibis.options.impala.temp_db = 'foo'

conn = ibis.impala.connect(host='impala',
                           database='ibis_testing',
                           hdfs_client=ibis.hdfs_connect(host='impala', port=50070))

I think it has few drawbacks, in particular:

It's not very intuitive, since ibis.impala seems to be a impala module in ibis, but it was originally ibis/impala/api.py, now it's ibis/backends/impala, and potentially it can be a module in any other package
The code becomes trickier, since we need to use __getattr__, load the backends dynamically, but ideally keeping the attributes f ibis introspectable, including backends
The ibis namespace is already huge, mixing special attributes for backends makes things more complex
When a backend is misspelled or inexistent, the error message is not very intuitive, since we can't know if the user was trying to call a backend, or a regular attribute

An alternative API that I personally find easier is:

import ibis

conn = ibis.connect(engine='impala,
                    host='impala',
                    database='ibis_testing',
                    hdfs_client=ibis.hdfs_connect(host='impala', port=50070),
                    temp_db='foo')

I think this API reduces the magic significantly. There is a function that once called will look for the backend and connect to it. If the backend doesn't exist, we can provide a clear error message.

The only drawback I see is that the signature of connect won't be very specific ibis.connect(engine, **kwargs). The backend connect function will have the parameters clearly documented, so I don't think it makes a big difference for the documentation. But for introspection we would have **kwargs.

Something that could make things simpler is to use a connection string instead of **kwargs:

conn = ibis.connect(conn_str='impala://user@impala/ibis_testing',
                    hdfs_client=ibis.hdfs_connect(host='impala', port=50070),
                    temp_db='foo')

This should work directly with SQLAlchemy backends and OmniSciDB, I guess some backends may require to parse the url in the Ibis backend, but that doesn't seem like a big deal.

Somehow unrelated, I'd move backend specific options to the connection object, instead of having them as options. I think this will make things easier and clear, both for Ibis maintainers and for users.

I'd add the ibis.connect method for 2.0, and a FutureWarning for ibis.<backend>, and remove the latter in Ibis 3.0.

@jreback happy with this?

matthewmturner commented 3 years ago

I can give this a shot.

@jreback were you ok with it?

jreback commented 3 years ago

i will have a look at the proposal soon

jreback commented 3 years ago

conceptually this is a good idea. However I also don't like the kwargs passing. I don't see any easy way around this, except putting all the named args in a biggish signature that unites the backends, but i don't think this is a good idea for other reasons.

I am not sure this is suffering from issues @datapythonista as you describe. in particular

It's not very intuitive, since ibis.impala seems to be a impala module in ibis, but it was originally ibis/impala/api.py, now it's ibis/backends/impala, and potentially it can be a module in any other package

a user doesn't know / care about this, it either works or it doesn't

The code becomes trickier, since we need to use getattr, load the backends dynamically, but ideally keeping the attributes f ibis introspectable, including backends

not so concerned about this we already have the code

The ibis namespace is already huge, mixing special attributes for backends makes things more complex

not so concerned about this

When a backend is misspelled or inexistent, the error message is not very intuitive, since we can't know if the user was trying to call a backend, or a regular attribute

THIS is the reason to do something.

so another possibility is to do

ibis.connect(engine='imapala').config() where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.

matthewmturner commented 3 years ago

@datapythonista any thoughts on the above? conceptually makes sense to me - in particular to not have kwargs.

I can take a stab if youre both in agreement on the proposal from @jreback .

As it relates to the naming the method after connect an alternative to config could be params

datapythonista commented 3 years ago

ibis.connect(engine='imapala').config() where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.

I personally find this syntax as complicated and counter-intuitive as the existing one. ibis.connect(engine='impala') won't be connecting, but returning a class or a module, similar as we are doing now with ibis.impala. I think it does help fix the ambiguity of thinking that ibis.impala is a module itself, but I find it unnecessarily cumbersome compared to ibis.connect('impala://user@impala/ibis_testing').

If you prefer the two calls approach, I think we should go with SQLAlchemy way:

engine = ibis.create_engine('impala://user@impala/ibis_testing')
connection = engine.connect()

Or a variant, moving the parameters to connect, if you want to avoid the connection string:

engine = ibis.create_engine('impala')
connection = engine.connect(host='impala', user='me')

Feels like it's more verbose than needed and a bit cumbersome, but at least all objects are conceptually clear. The second option of moving the parameters to connect will be confusing for SQLAlchemy users, and while introspection will work, the documentation will still be split by backend.

So, my preference would be simply:

ibis.connect('impala://user@impala/ibis_testing')

But fine with the two previous approaches too, if you a good reason for a two call approach.

tswast commented 3 years ago

I could see this being a problem for the "credentials" argument to the BigQuery backend, but I imagine most people are using default credentials.

For custom authentication, we could potentially pass in URL-encoded JSON as a URL parameter in the connection string. Or perhaps string arguments for common authentication mechanisms like a path to service account key.

datapythonista commented 3 years ago

Thanks for the heads up @tswast, that's useful to know. I'm not so familiar with BigQuery, would something like this make sense?

ibis.connect('bigquery://project-id', credentials='credentials.json')

Or maybe credentials being a file descriptor, or the json string, or whatever.

Having **kwargs in connect specific to each backend is something that I think we surely need to have. So, providing the credentials doesn't seem an issue to me even if they can't be easily added to the connection string.

What would be required, would be to have a connection string. I think you need to provide a project id, so I guess that the one in the connection string in the example could make sense. But what would feel a bit strange I think (not a technical problem, but a counter-intuitive API) would be if the project id or whatever is also in **kwargs, since the syntax then would need to be ibis.connect('bigquery://', project_id='whatever', ...).

tswast commented 3 years ago

Yes, bigquery://project-id makes sense to me. Honestly, I'd probably end up doing some refactoring to maybe share some connection string logic with the SQLAlchemy connector https://github.com/mxmzdlv/pybigquery#connection-string-parameters

    'bigquery://some-project/some-dataset' '?'
    'credentials_path=/some/path/to.json' '&'
    'location=some-location' '&'
    'arraysize=1000' '&'
    'clustering_fields=a,b,c' '&'
    'create_disposition=CREATE_IF_NEEDED' '&'
    'destination=different-project.different-dataset.table' '&'
    'destination_encryption_configuration=some-configuration' '&'
    'dry_run=true' '&'
    'labels=a:b,c:d' '&'
    'maximum_bytes_billed=1000' '&'
    'priority=INTERACTIVE' '&'
    'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&'
    'use_query_cache=true' '&'
    'write_disposition=WRITE_APPEND'

datapythonista commented 3 years ago

That sounds great. And standardizing with SQLAlchemy sounds like a great idea.

datapythonista commented 3 years ago

Next are the parameters of the connect method of each backend, to have them handy to see if it makes sense to use connection strings, or what a better API could be.

Connection string (host, user, database...)

Backend	connection_string	kwargs
bigquery	bigquery://project-id/dataset-id	credentials application_name auth_local_webserver auth_external_data auth_cache
clickhouse	clickhouse+driver://user:password@host:port/database	client_name compression
impala	impala+driver://user:password@host:port/database	timeout use_ssl ca_cert auth_mechanism kerberos_service_name pool_size hdf_client
mysql	mysql+driver://user:password@host:port/database
postgres	postgres+driver://user:password@host:port/database
mssql	mssql+driver://user:password@host:port/database	odbc_driver
omniscidb	omniscidb+driver://user:password@host:port/database	protocol session_id ipc gpu_device

Connection string (path)

Backend	connection_string	kwargs
csv	csv+driver://path
hdf5	hdf5+driver://path
parquet	parquet+driver://path
sqlite	sqlite+driver://path	create

Python object Backend	params
dask	dict of dask dataframes
pandas	dict of pandas dataframes
arrow	dict of arrow recordbatches
spark	spark_session
pyspark	spark_session

bigquery

project_id
dataset_id
credentials
application_name
auth_local_webserver
auth_external_data
auth_cache

clcikhouse

host
port
database
user
password
client_name
compression

csv

path

dask

dictionary

hdf5

path

impala

host
port
database
timeout
use_ssl
ca_cert
user
password
auth_mechanism
kerberos_service_name
pool_size
hdfs_client

mysql

host
user
password
port
database
url
driver

pandas

dictionary

parquet

dictionary

postgres

host
user
password
port
database
url
driver

pyspark

session

spark

spark_session

sqlite

path
create

mssql

host
user
password
port
database
driver
odbc_driver
url

omniscidb

uri
user
password
host
port
database
protocol
session_id
ipc
gpu_device

tswast commented 3 years ago

I tried adding the needed entry_points to the BigQuery backend in https://github.com/ibis-project/ibis-bigquery/pull/38, but I'm having the hardest time avoiding circular dependencies with the current discovery implementation.

A benefit of a connect() method is that is a little more natural to do lazy evaluation for discovery. That is, don't try to discover all the backends at import time, try to discover them the first time connect() is called.

datapythonista commented 3 years ago

Agree on that. But technically we can also evaluate lazily on __getattr__ of ibis, and load the backends on ibis.bigquery or equivalent. The reason we're not doing that are options. The next would fail, since the backend is not loaded and the option doesn't exist if the backends hasn't been loaded. That's the main reason I don't think we should have backend options.

ibis.options.bigquery.parition_col = 'foo'

ibis.bigquery.connect(connection_string)

On the circular imports, I fail to see how having the backend as a separate repo is affecting it. Everything should be the same, backends are being loaded as entrypoints if they are in this repo too. So, nothing should have changed I'd say. Maybe I'm missing something, but seems to me like the cause should be something else.

tswast commented 3 years ago

Lazy evaluation would definitely help. I think it's that in some of the tests I import ibis_bigquery before I import ibis. I'll look into that further next week

cpcloud commented 2 years ago

Connection string logic seems like a good approach here. There's a lot discussion on this issue, which I appreciate, but it's better to work out the details for something this complex against real code so folks can try it out and collaborate.

ibis-project / ibis

Change API to connect to a backend #2532