Open jorisvandenbossche opened 10 years ago
If we want to tackle this for 0.15, we should start with this.
@hayd @mangecoeur @danielballan @JoergRittinger @artemyk @maxgrenderjones @aldanor @jreback
Some input would be useful. What do you think of above questions? What would you want to be able to do with the object interface?
@jorisvandenbossche Let me throw out another possibility that occurred to me after some hacking away on SQL code:
backend
classes that abstract away differences between sqlalchemy and legacy mode (implementing methods like execute
, drop_table
, creation of tables, insert statements, appropriate representation of tables & metadata, etc.).PandasSQLAlchemy
and PandasSQLTable
into a single class --- its not really clear to me how useful is the distinction between these two. Drop the legacy versions of these classes. Let's just call this class PandasSQL
for now.PandasSQL
a property that points to the proper backend object (either sqlalchemy or legacy); have it call methods on this backend when it needs to actually hit the database.@artemyk unifying PandasSQLAlchemy
and PandasSQLTable
is not a good idea. I originally wrote some of that code and started with a unified model and quickly found it was better to split so that you save table related state and methods with the table and engine related state and methods separately. (actually if memory serves there was some parallel development by other ppl who also converged on the same solution)
Additionally, it is more flexible if you want to add features or create subclasses. For instance you might want to manipulate more than one table at a time with the same engine.
@jorisvandenbossche this is my input (i'm slowly getting back to dealing with this after long period of other work).
Personally I'm happy with the architecture, especially with respect to being able to create subclasses to add custom functionality.
Being able to provide custom MetaData and Table definitions I think is important, as well as to override the various type conversion routines. However it would be nice to avoid cramming in too many keyword arguments, so it would be better to let people do this by subclassing rather than passing optional parameters. We should make sure that the architecture is truely modular and you can swap in your own classes as needed. The exception may be for schema handling where you currently need to pass in MetaData - but from recent commits it seems this will be handled through a schema parameter (I personally use this and so far have had to create a custom metadata object).
With respect to HDF5Store
like getitem, and in general HDF5-store-like patterns: I think we need to be wary of excessive consistency. There are major differences between how SQL and HDF work and what their features are, so while it makes sense that they should work in broadly familiar ways we shouldn't force SQL to have a certain api just because HDF5 does.
So for example I find that pandas_sql['table_name']
doesn't give the impression that you are starting a potentially heavy and slow IO task, while pandas_sql.read_table('table_name')
makes it clear that you are triggering a read process.
As to how we see PandasSQLAlchemy
and PandasSQLTable
- both should be considered public for maximum flexibility, with the understanding that only advanced users should need them. PandasSQLAlchemy
holds deals with state and data relating to the database as a whole - so Engine + MetaData
(since there is always a 1:1 relation between engine and metadata). PandasSQLTable
holds state and data relating to a given data table. This might be extended (by us of by users) to also cover things like SQL Views.
Finally for naming - I'm not great with names, but I like simply SQLDatabase
and SQLTable
, with corresponding SQLiteDatabase
and SQLiteTable
to replace Legacy...
since legacy is now only SQLite anyway.
@mangecoeur good to have your intput! (I wanted to ask it as you wrote part of the current code base (also @danielballan I think?), but you already did) I will try to comment later more in depth, but a question: regarding to the subclassing/adapting functionality, can you give some concrete use cases? (maybe some code you have lying around that uses PandasSQLTable?) To see if we can improve things to make this easier/more modular. Some customizations you could want to do that I think of now:
meta
object is easy -> just create a PandasSQLEngine(engine, meta=meta)
Table
object: you can do this by overwriting the automatically generated table property PandasSQLTable.table = my_custom_table
I suppose?_sqlalchemy_type
I suppose. However, this is not really a type conversion, only to specify the SQLAlchemy type in the Table. And this is also not that user friendly if you just want to adapt one type (you have to redefine the whole function)Do you have other use cases?
I think we should try to add an 'advanced' section in the docs to give some examples of these use cases.
And indeed, for solely specifying a schema
, you don't need to specify a custom meta object anymore, this is handled by the schema
keyword (as this is probably the biggest use-case for having to modify the meta)
@jorisvandenbossche some use cases for altering the internals of SQLEngine or SQLTable
I'm sure there is more and we don't have to implement any of this, just make sure that you can swap out one or the other of Engine or Table without it blowing up in your face, and that overriding in sub-classes is reasonably straightforward.
@mangecoeur +1 on all of those conversions, except the first/second. They user should simply cast the frame as appropriate. Many ways to do this and should not be specific to SQL.
@jreback I think someone will find the need for both the first and second, never underestimate how weird someone's use case might be.
The point however is we don't need to care whether they do or not, just that we make the system reasonably easy to subclass and customize. If people want to make something completely ridiculous with it afterwards its up to them :P
@mangecoeur yes, people do have odd things they try to do!
@mangecoeur The problem is, I think we have to care what people would do with it, as otherwise we could easily break things.
So I think we will have to outline some typical advanced usage case, and test for them (and document them). Because along the way af adding features/modifying the internals of SQLTable, we can certainly break such usage cases if they rely on the specific implementation in the internals.
E.g. the example I gave of providing a custom Table. In the way I said above (SQLTable.table = my_custom_table
). I think this will not work anymore with current master, because of the way the meta object is handled.
@jorisvandenbossche true, so we need to make an internal api that will be modular and stable and easy to replace. Cant say what that might be without diving in, no time just now, hopefully soon...
I think, for 0.15, I will try to do some of the renaming (SQLDatabase, SQLTable, ..), as we agree on this I think (and then the old names don't linger on longer than needed). The other things discussed here will then be for a next release.
@artemyk I also think the distinction between Database and Table is good, but trying to abstract the differences between sqlalchemy and sqlite away I think would maybe be a good idea, as it is now sometimes difficult to add something without duplicating a lot of code. Using something like a backend class would possibly a way, you can always try something and see how things work out. It is however not yet fully clear what would be the difference with the existing PandasSQLAlchemy and PandasSQLLegacy (or it would split the current class in mode (sqlalchemy/sqlite) specific things and more general methods like to_sql/read_sql?)
This sounds good to me, @jorisvandenbossche. Sorry to offering minimal input -- defending my thesis this fall!
@danielballan good luck! knock 'em dead!
@danielballan good luck! I put up #8440 with some of the renaming. Should there be deprecations of the old names? I did not do it now, as it were not yet publicized functions in the docs, but someone thinks it is needed, I can add this.
@jreback @jorisvandenbossche I made an initial stab at reorganizing the backend code in #8562 . It eliminates some redundancy and now SQLTable
does not depend on the backend-type anymore (Instead SQLDatabase
and SQLiteDatabase
handle all the dependencies). Thoughts?
Follow-up of #6300 to discuss the object-oriented API for the sql functions.
So there are two ways to interact with SQL databases:
read_sql_query
,read_sql_table
,to_sql
The high-level functions are already quite polished (but there are still some issues with NaN, datetime, schema, ..), but the OO API should still be discussed and is not yet really public.
Current design:
PandasSQLAlchemy
object (withexecute
,read_sql
,read_table
,to_sql
,has_table
,drop_table
methods)PandasSQLTable
object to do the actual table reading/writingSome questions remain:
MetaData
Table
description instead of generating automatically?HDF5Store
? (getitem/setitem support? egpandas_sql['table_name']
to read a table?)PandasSQLAlchemy
? Or alsoPandasSQLTable
?PandasSQLAlchemy
should be public?PandasSQLAlchemy
-> do we need a better name? ('SQLDatabase', 'SQLdb', 'SQLConnection', ...)read_sql
->read_query
?)PandasSQLAlchemy
?Engine
? or wrapper aroundEngine
+MetaData