Closed datapythonista closed 2 years ago
I can give this a shot.
@jreback were you ok with it?
i will have a look at the proposal soon
conceptually this is a good idea. However I also don't like the kwargs passing. I don't see any easy way around this, except putting all the named args in a biggish signature that unites the backends, but i don't think this is a good idea for other reasons.
I am not sure this is suffering from issues @datapythonista as you describe. in particular
It's not very intuitive, since ibis.impala seems to be a impala module in ibis, but it was originally ibis/impala/api.py, now it's ibis/backends/impala, and potentially it can be a module in any other package
a user doesn't know / care about this, it either works or it doesn't
The code becomes trickier, since we need to use getattr, load the backends dynamically, but ideally keeping the attributes f ibis introspectable, including backends
not so concerned about this we already have the code
The ibis namespace is already huge, mixing special attributes for backends makes things more complex
not so concerned about this
When a backend is misspelled or inexistent, the error message is not very intuitive, since we can't know if the user was trying to call a backend, or a regular attribute
THIS is the reason to do something.
so another possibility is to do
ibis.connect(engine='imapala').config()
where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.
@datapythonista any thoughts on the above? conceptually makes sense to me - in particular to not have kwargs.
I can take a stab if youre both in agreement on the proposal from @jreback .
As it relates to the naming the method after connect
an alternative to config
could be params
ibis.connect(engine='imapala').config()
where the connect is generic and dispatches to a config function (or better name) in each backend to get specific arguments.
I personally find this syntax as complicated and counter-intuitive as the existing one. ibis.connect(engine='impala')
won't be connecting, but returning a class or a module, similar as we are doing now with ibis.impala
. I think it does help fix the ambiguity of thinking that ibis.impala
is a module itself, but I find it unnecessarily cumbersome compared to ibis.connect('impala://user@impala/ibis_testing')
.
If you prefer the two calls approach, I think we should go with SQLAlchemy way:
engine = ibis.create_engine('impala://user@impala/ibis_testing')
connection = engine.connect()
Or a variant, moving the parameters to connect
, if you want to avoid the connection string:
engine = ibis.create_engine('impala')
connection = engine.connect(host='impala', user='me')
Feels like it's more verbose than needed and a bit cumbersome, but at least all objects are conceptually clear. The second option of moving the parameters to connect will be confusing for SQLAlchemy users, and while introspection will work, the documentation will still be split by backend.
So, my preference would be simply:
ibis.connect('impala://user@impala/ibis_testing')
But fine with the two previous approaches too, if you a good reason for a two call approach.
I could see this being a problem for the "credentials" argument to the BigQuery backend, but I imagine most people are using default credentials.
For custom authentication, we could potentially pass in URL-encoded JSON as a URL parameter in the connection string. Or perhaps string arguments for common authentication mechanisms like a path to service account key.
Thanks for the heads up @tswast, that's useful to know. I'm not so familiar with BigQuery, would something like this make sense?
ibis.connect('bigquery://project-id', credentials='credentials.json')
Or maybe credentials being a file descriptor, or the json string, or whatever.
Having **kwargs
in connect specific to each backend is something that I think we surely need to have. So, providing the credentials doesn't seem an issue to me even if they can't be easily added to the connection string.
What would be required, would be to have a connection string. I think you need to provide a project id, so I guess that the one in the connection string in the example could make sense. But what would feel a bit strange I think (not a technical problem, but a counter-intuitive API) would be if the project id or whatever is also in **kwargs
, since the syntax then would need to be ibis.connect('bigquery://', project_id='whatever', ...)
.
Yes, bigquery://project-id
makes sense to me. Honestly, I'd probably end up doing some refactoring to maybe share some connection string logic with the SQLAlchemy connector https://github.com/mxmzdlv/pybigquery#connection-string-parameters
'bigquery://some-project/some-dataset' '?'
'credentials_path=/some/path/to.json' '&'
'location=some-location' '&'
'arraysize=1000' '&'
'clustering_fields=a,b,c' '&'
'create_disposition=CREATE_IF_NEEDED' '&'
'destination=different-project.different-dataset.table' '&'
'destination_encryption_configuration=some-configuration' '&'
'dry_run=true' '&'
'labels=a:b,c:d' '&'
'maximum_bytes_billed=1000' '&'
'priority=INTERACTIVE' '&'
'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&'
'use_query_cache=true' '&'
'write_disposition=WRITE_APPEND'
That sounds great. And standardizing with SQLAlchemy sounds like a great idea.
Next are the parameters of the connect
method of each backend, to have them handy to see if it makes sense to use connection strings, or what a better API could be.
Connection string (host, user, database...)
Backend | connection_string | kwargs |
---|---|---|
bigquery | bigquery://project-id/dataset-id | credentials application_name auth_local_webserver auth_external_data auth_cache |
clickhouse | clickhouse+driver://user:password@host:port/database | client_name compression |
impala | impala+driver://user:password@host:port/database | timeout use_ssl ca_cert auth_mechanism kerberos_service_name pool_size hdf_client |
mysql | mysql+driver://user:password@host:port/database | |
postgres | postgres+driver://user:password@host:port/database | |
mssql | mssql+driver://user:password@host:port/database | odbc_driver |
omniscidb | omniscidb+driver://user:password@host:port/database | protocol session_id ipc gpu_device |
Connection string (path)
Backend | connection_string | kwargs |
---|---|---|
csv | csv+driver://path | |
hdf5 | hdf5+driver://path | |
parquet | parquet+driver://path | |
sqlite | sqlite+driver://path | create |
Python object Backend | params |
---|---|
dask | dict of dask dataframes |
pandas | dict of pandas dataframes |
arrow | dict of arrow recordbatches |
spark | spark_session |
pyspark | spark_session |
bigquery
clcikhouse
csv
dask
hdf5
impala
mysql
pandas
parquet
postgres
pyspark
spark
sqlite
mssql
omniscidb
I tried adding the needed entry_points
to the BigQuery backend in https://github.com/ibis-project/ibis-bigquery/pull/38, but I'm having the hardest time avoiding circular dependencies with the current discovery implementation.
A benefit of a connect()
method is that is a little more natural to do lazy evaluation for discovery. That is, don't try to discover all the backends at import time, try to discover them the first time connect()
is called.
Agree on that. But technically we can also evaluate lazily on __getattr__
of ibis
, and load the backends on ibis.bigquery
or equivalent. The reason we're not doing that are options. The next would fail, since the backend is not loaded and the option doesn't exist if the backends hasn't been loaded. That's the main reason I don't think we should have backend options.
ibis.options.bigquery.parition_col = 'foo'
ibis.bigquery.connect(connection_string)
On the circular imports, I fail to see how having the backend as a separate repo is affecting it. Everything should be the same, backends are being loaded as entrypoints if they are in this repo too. So, nothing should have changed I'd say. Maybe I'm missing something, but seems to me like the cause should be something else.
Lazy evaluation would definitely help. I think it's that in some of the tests I import ibis_bigquery
before I import ibis
. I'll look into that further next week
Connection string logic seems like a good approach here. There's a lot discussion on this issue, which I appreciate, but it's better to work out the details for something this complex against real code so folks can try it out and collaborate.
Currently, the API to connect to a backend, with a backend specific option, is as follows:
I think it has few drawbacks, in particular:
ibis.impala
seems to be a impala module in ibis, but it was originallyibis/impala/api.py
, now it'sibis/backends/impala
, and potentially it can be a module in any other package__getattr__
, load the backends dynamically, but ideally keeping the attributes fibis
introspectable, including backendsibis
namespace is already huge, mixing special attributes for backends makes things more complexAn alternative API that I personally find easier is:
I think this API reduces the magic significantly. There is a function that once called will look for the backend and connect to it. If the backend doesn't exist, we can provide a clear error message.
The only drawback I see is that the signature of
connect
won't be very specificibis.connect(engine, **kwargs)
. The backendconnect
function will have the parameters clearly documented, so I don't think it makes a big difference for the documentation. But for introspection we would have**kwargs
.Something that could make things simpler is to use a connection string instead of
**kwargs
:This should work directly with SQLAlchemy backends and OmniSciDB, I guess some backends may require to parse the url in the Ibis backend, but that doesn't seem like a big deal.
Somehow unrelated, I'd move backend specific options to the connection object, instead of having them as options. I think this will make things easier and clear, both for Ibis maintainers and for users.
I'd add the
ibis.connect
method for 2.0, and aFutureWarning
foribis.<backend>
, and remove the latter in Ibis 3.0.@jreback happy with this?