[ENH] Allow passing MetaData (or meta kwargs) to high-level SQL functions

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.19k stars 17.77k forks source link

[ENH] Allow passing MetaData (or meta kwargs) to high-level SQL functions #7441

Open aldanor opened 10 years ago

aldanor commented 10 years ago

Use case: I want to query an Oracle table but I'm not the owner, so meta.reflect(engine) would do nothing and always return an empty list of tables rendering read_sql_table useless. This works: meta.reflect(engine, schema='the_real_owner'), however there's no way of passing the metadata / schema to the high-level pandas.io.sql functions. One may also want to use a meta.reflect(engine, only=[...]) or meta.reflect(engine, oracle_resolve_synonyms=True) or any other dialect-specific argument.

PandasSQLAlchemy class already support passing meta in the constructor, so it's just a matter of a adding an extra keyword argument to the three io.sql.read_sql* functions. Or maybe allow passing through arbitrary kwargs to meta's reflect method, like read_sql_table(table, engine, reflect=dict(schema='my_schema', oracle_resolve_synonyms=True)) -- ugly but functional.

jorisvandenbossche commented 10 years ago

@aldanor on the only keyword you mentioned, this is already filed as #7396.

For passing a MetaData object to the read_sql_query/table functions, this was implemented partially in the past, but removed from the functional API (see discussion here: https://github.com/pydata/pandas/issues/6300#issuecomment-43001653 and commit https://github.com/pydata/pandas/commit/ad97d2740201f6b7687dfccebd85cc348e8bfbe3) to keep that as simple as possible for a first release (easier to add than to remove keywords). In any case, it is available in the OO API with PandasSQLAlchemy.

So basically, we just have to decide if we want to provide this functionality in both (read_* functions and object-interface), or only via the object (so that when you need this functionality, you should use the OO API).

aldanor commented 10 years ago

@jorisvandenbossche It may be just me, but I think it might make sense to allow the metadata object to be passed to the higher-level functions (or a set of arbitrary kwargs for reflect since there may be dialect-specific options) for those functions to be useful in setups more involved than the most basic ones.

In some environments (e.g. a typical corporate Oracle setup), you would absolutely have to pass the schema argument for reflect to work. And you may want to pass only in case your schema hosts ten thousand tables.

jorisvandenbossche commented 10 years ago

What would be the most common things you would want to adapt in the meta object? Of course the schema, but are there also other things? Can you give some examples?

Just asking as if the schema argument would be used a lot, I think we should provide it as a real keyword and not only through passing a MetaData object (most users should not start adapting a MetaData object themselves, so if a lot of users need the schema argument, we should provide it seperately of meta=..).

@mangecoeur @danielballan @hayd

aldanor commented 10 years ago

@jorisvandenbossche Do we consider the only argument being a special case here (as in, don't query the entire schema in read_sql functions)? As for the schema being a separate keyword, should the users be able to pass meta as well, so the function signature would contain schema=None, meta=None? If that doesn't clutter the API too much, that'd be nice.

As for myself, I sometimes have to use oracle_resolve_synonyms=True in MetaData#reflect or Table#__init__ when working with dblinks. I guess there exist a bunch of other useful dialect-specific options that I'm not aware of.

jorisvandenbossche commented 10 years ago

As I said above, the only keyword is already discussed in #7396, and it will be used by default (reflect only the database you are reading, and none for read_sql_query) so no seperate keyword needed I think? If you have further comments or remarks on that, you can comment there.

For the API, that seems a good option. Another one would be to provide a way to pass options directly to reflect with something like reflect_kwds={...}, but that seems a bit more ugly (and limits it to reflect). Passing things to Table I think is too advanced for read_sql, but maybe we can see how we can do this in the OO API.

Would you be interested to do a PR to add this?

aldanor commented 10 years ago

I'm by no means sqlalchemy expert, so if someone more skilled in this would do it, it would be a better idea :) Elsewise I can give it a try on the weekend, will have to figure out how to set up test environment etc for the sql test suite then.

Re: API, schema and meta keywords look fine, what about the only keyword when you pass the meta in? In this case the user is fully responsible for reflecting whatever he needs, right?

aldanor commented 10 years ago

@jorisvandenbossche Would someone help me out in setting up a test environment so that I can run all tests in test_sql.py without skipping? I mean usernames, passwords, databases, schemas for both mysql and postgresql. It's quite confusing by just looking at the code and it may make sense to put intitialization scripts somewhere in the test suite to make the whole thing more accessible to other developers (kind of like this: https://bitbucket.org/anthony_tuininga/cx_oracle/src/dcc80926aa86d524f36500a27a658d3b4c14cbb1/test/SetupTest.sql?at=default).

jreback commented 10 years ago

@aldanor that would close #7091

jorisvandenbossche commented 10 years ago

An overview:

sqlite: built-in and using in memory database -> nothing to set-up
mysql: root@localhost/pandas_nosetest -> database 'pandas_nosetest' with username 'root' and blank password, on host 'localhost'
postgresql: postgres@localhost/pandas_nosetest -> database 'pandas_nosetest' with username 'postgres' and blank password, on host 'localhost'

So no special schema or something, I think for both default values when you would create such a database locally. They are created here https://github.com/pydata/pandas/blob/master/.travis.yml#L99 with mysql -e 'create database pandas_nosetest;'

If you want tro try to set-up something like #7091 or what you linked to, certainly try! If we support specifying schema's, will will also have to test this, so the database set-up will have to be more advanced.

jorisvandenbossche commented 10 years ago

@aldanor Did it work out to look at this?

jreback commented 10 years ago

@jorisvandenbossche status?

jorisvandenbossche commented 10 years ago

I have unfortunately no time this week to do it myself, so pushing to 0.15

jorisvandenbossche commented 10 years ago

Example usages:

http://stackoverflow.com/questions/24189150/pandas-writing-dataframe-to-other-postgresql-schema
http://stackoverflow.com/questions/25173887/pandas-to-sql-cant-write-to-schema-besids-public-on-postgresql -> problem with appending/replacing with my desribed workaround

jorisvandenbossche commented 10 years ago

@aldanor I started implementing this, see #7952. For now only a schema support. Could you try this out to see if this solves your basic problems reported here?

jorisvandenbossche commented 9 years ago

This is partly implemented (schema kwarg, possibility to pass meta to SQLDatabase), but pushing to 0.15.1 to see if we want to provide more.

Homesteady commented 4 years ago

I'm resurrecting an ancient issue here, but is there any interest in including schema as an optional kwarg in read_sql? The read_sql wrapper is a convenient one, but the inability to pass a schema value to be used by read_sql_table limits the usefulness of this method in practice. I have a WIP PR adding this in, but it looks like this has been debated already...?

Anyways, it's an addition that I would find very useful, but I am new to contributing to Pandas, so someone feel free to please set me straight here. 🙏