API: object oriented interface for SQL functionality

jorisvandenbossche commented 10 years ago

Follow-up of #6300 to discuss the object-oriented API for the sql functions.

So there are two ways to interact with SQL databases:

more simple, high-level functions: read_sql_query, read_sql_table, to_sql
"OO API" for more advanced functionality

The high-level functions are already quite polished (but there are still some issues with NaN, datetime, schema, ..), but the OO API should still be discussed and is not yet really public.

Current design:

PandasSQLAlchemy object (with execute, read_sql, read_table, to_sql, has_table, drop_table methods)
PandasSQLTable object to do the actual table reading/writing

Some questions remain:

What is this 'more advanced' usage that we want to enable in the API?
- possibility to provide custom MetaData
- the fact you don't recreate the object and metadata for each query (if you do multiple queries)
- other sqlalchemy features people would want? Possibility to provide your own Table description instead of generating automatically?
- more like HDF5Store? (getitem/setitem support? eg pandas_sql['table_name'] to read a table?)
What do we regard as public API?
- Only PandasSQLAlchemy? Or also PandasSQLTable?
- which methods of PandasSQLAlchemy should be public?
Naming:
- naming of PandasSQLAlchemy -> do we need a better name? ('SQLDatabase', 'SQLdb', 'SQLConnection', ...)
- naming of the methods (read_sql -> read_query?)
How do we see PandasSQLAlchemy?
- wrapper around an Engine? or wrapper around Engine + MetaData
- do we want to support reading/writing to multiple schema's? Or do you need to create seperate objects for each schema?

jorisvandenbossche commented 10 years ago

If we want to tackle this for 0.15, we should start with this.

@hayd @mangecoeur @danielballan @JoergRittinger @artemyk @maxgrenderjones @aldanor @jreback
Some input would be useful. What do you think of above questions? What would you want to be able to do with the object interface?

artemyk commented 10 years ago

@jorisvandenbossche Let me throw out another possibility that occurred to me after some hacking away on SQL code:

Have two backend classes that abstract away differences between sqlalchemy and legacy mode (implementing methods like execute, drop_table, creation of tables, insert statements, appropriate representation of tables & metadata, etc.).
Unify PandasSQLAlchemy and PandasSQLTable into a single class --- its not really clear to me how useful is the distinction between these two. Drop the legacy versions of these classes. Let's just call this class PandasSQL for now.
Give PandasSQL a property that points to the proper backend object (either sqlalchemy or legacy); have it call methods on this backend when it needs to actually hit the database.

mangecoeur commented 10 years ago

@artemyk unifying PandasSQLAlchemy and PandasSQLTable is not a good idea. I originally wrote some of that code and started with a unified model and quickly found it was better to split so that you save table related state and methods with the table and engine related state and methods separately. (actually if memory serves there was some parallel development by other ppl who also converged on the same solution)

Additionally, it is more flexible if you want to add features or create subclasses. For instance you might want to manipulate more than one table at a time with the same engine.

mangecoeur commented 10 years ago

@jorisvandenbossche this is my input (i'm slowly getting back to dealing with this after long period of other work).

Personally I'm happy with the architecture, especially with respect to being able to create subclasses to add custom functionality.

Being able to provide custom MetaData and Table definitions I think is important, as well as to override the various type conversion routines. However it would be nice to avoid cramming in too many keyword arguments, so it would be better to let people do this by subclassing rather than passing optional parameters. We should make sure that the architecture is truely modular and you can swap in your own classes as needed. The exception may be for schema handling where you currently need to pass in MetaData - but from recent commits it seems this will be handled through a schema parameter (I personally use this and so far have had to create a custom metadata object).

With respect to HDF5Store like getitem, and in general HDF5-store-like patterns: I think we need to be wary of excessive consistency. There are major differences between how SQL and HDF work and what their features are, so while it makes sense that they should work in broadly familiar ways we shouldn't force SQL to have a certain api just because HDF5 does.

So for example I find that pandas_sql['table_name'] doesn't give the impression that you are starting a potentially heavy and slow IO task, while pandas_sql.read_table('table_name') makes it clear that you are triggering a read process.

As to how we see PandasSQLAlchemy and PandasSQLTable - both should be considered public for maximum flexibility, with the understanding that only advanced users should need them. PandasSQLAlchemy holds deals with state and data relating to the database as a whole - so Engine + MetaData (since there is always a 1:1 relation between engine and metadata). PandasSQLTable holds state and data relating to a given data table. This might be extended (by us of by users) to also cover things like SQL Views.

Finally for naming - I'm not great with names, but I like simply SQLDatabase and SQLTable, with corresponding SQLiteDatabase and SQLiteTable to replace Legacy... since legacy is now only SQLite anyway.

jorisvandenbossche commented 10 years ago

@mangecoeur good to have your intput! (I wanted to ask it as you wrote part of the current code base (also @danielballan I think?), but you already did) I will try to comment later more in depth, but a question: regarding to the subclassing/adapting functionality, can you give some concrete use cases? (maybe some code you have lying around that uses PandasSQLTable?) To see if we can improve things to make this easier/more modular. Some customizations you could want to do that I think of now:

providing a custom meta object is easy -> just create a PandasSQLEngine(engine, meta=meta)
providing custom Table object: you can do this by overwriting the automatically generated table property PandasSQLTable.table = my_custom_table I suppose?
adapting type conversion routines: you have to overwrite _sqlalchemy_type I suppose. However, this is not really a type conversion, only to specify the SQLAlchemy type in the Table. And this is also not that user friendly if you just want to adapt one type (you have to redefine the whole function)

Do you have other use cases?

I think we should try to add an 'advanced' section in the docs to give some examples of these use cases.

And indeed, for solely specifying a schema, you don't need to specify a custom meta object anymore, this is handled by the schema keyword (as this is probably the biggest use-case for having to modify the meta)

mangecoeur commented 10 years ago

@jorisvandenbossche some use cases for altering the internals of SQLEngine or SQLTable

force a specific type conversion, e.g. float to int because of incorrect incoming data
create custom object columns for some type (on the fly parsing)
handle functionality specific to a given DB backend, such as Postgres HStore, that would never go into the generic implementation
Create a custom engine handler that can run queries in parallel
Support a DB cluster
create a custom SQLTable class that could be mapped to a Join, a View
create a custom SQLTable that returns a data stream instead

I'm sure there is more and we don't have to implement any of this, just make sure that you can swap out one or the other of Engine or Table without it blowing up in your face, and that overriding in sub-classes is reasonably straightforward.

jreback commented 10 years ago

@mangecoeur +1 on all of those conversions, except the first/second. They user should simply cast the frame as appropriate. Many ways to do this and should not be specific to SQL.

mangecoeur commented 10 years ago

@jreback I think someone will find the need for both the first and second, never underestimate how weird someone's use case might be.

The point however is we don't need to care whether they do or not, just that we make the system reasonably easy to subclass and customize. If people want to make something completely ridiculous with it afterwards its up to them :P

jreback commented 10 years ago

@mangecoeur yes, people do have odd things they try to do!

jorisvandenbossche commented 10 years ago

@mangecoeur The problem is, I think we have to care what people would do with it, as otherwise we could easily break things.

So I think we will have to outline some typical advanced usage case, and test for them (and document them). Because along the way af adding features/modifying the internals of SQLTable, we can certainly break such usage cases if they rely on the specific implementation in the internals. E.g. the example I gave of providing a custom Table. In the way I said above (SQLTable.table = my_custom_table). I think this will not work anymore with current master, because of the way the meta object is handled.

mangecoeur commented 10 years ago

@jorisvandenbossche true, so we need to make an internal api that will be modular and stable and easy to replace. Cant say what that might be without diving in, no time just now, hopefully soon...

jorisvandenbossche commented 10 years ago

I think, for 0.15, I will try to do some of the renaming (SQLDatabase, SQLTable, ..), as we agree on this I think (and then the old names don't linger on longer than needed). The other things discussed here will then be for a next release.

@artemyk I also think the distinction between Database and Table is good, but trying to abstract the differences between sqlalchemy and sqlite away I think would maybe be a good idea, as it is now sometimes difficult to add something without duplicating a lot of code. Using something like a backend class would possibly a way, you can always try something and see how things work out. It is however not yet fully clear what would be the difference with the existing PandasSQLAlchemy and PandasSQLLegacy (or it would split the current class in mode (sqlalchemy/sqlite) specific things and more general methods like to_sql/read_sql?)

danielballan commented 10 years ago

This sounds good to me, @jorisvandenbossche. Sorry to offering minimal input -- defending my thesis this fall!

cpcloud commented 10 years ago

@danielballan good luck! knock 'em dead!

jorisvandenbossche commented 10 years ago

@danielballan good luck! I put up #8440 with some of the renaming. Should there be deprecations of the old names? I did not do it now, as it were not yet publicized functions in the docs, but someone thinks it is needed, I can add this.

artemyk commented 10 years ago

@jreback @jorisvandenbossche I made an initial stab at reorganizing the backend code in #8562 . It eliminates some redundancy and now SQLTable does not depend on the backend-type anymore (Instead SQLDatabase and SQLiteDatabase handle all the dependencies). Thoughts?

pandas-dev / pandas

API: object oriented interface for SQL functionality #7960