modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

Make Modin's distributed SQL connection object compatible with pandas. #4180

Open mvashishtha opened 2 years ago

mvashishtha commented 2 years ago

We can set the con parameter in Modin's read_sql to Modin's distributed SQL connection object, ModinDatabaseConnection. However, we can't do the same in pandas' read_sql, which requires the connection to be a "SQLAlchemy connectable, str, or sqlite3 connection".

read_sql in pandas has tested support for SQLAlchemy and sqlite3 connection objects. If read_sql gets a connection that is neither of those, it falls back to trying to use the object like it's a sqlite3 object, and warns

/usr/local/lib/python3.9/site-packages/pandas/io/sql.py:758: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy

I considered making Modin's connection class implement sqlalchemy.engine.Connectable, but that interface is deprecated. I can't find its source code in the sqlalchemy source code on github, but I used inspect to examine it and found:

Show `Connectable` interface ```python @util.deprecated_20_cls( ":class:`.Connectable`", alternative=( "The :class:`_engine.Engine` will be the only Core " "object that features a .connect() method, and the " ":class:`_engine.Connection` will be the only object that features " "an .execute() method." ), constructor=None, ) class Connectable(object): """Interface for an object which supports execution of SQL constructs. The two implementations of :class:`.Connectable` are :class:`_engine.Connection` and :class:`_engine.Engine`. Connectable must also implement the 'dialect' member which references a :class:`.Dialect` instance. """ def connect(self, **kwargs): """Return a :class:`_engine.Connection` object. Depending on context, this may be ``self`` if this object is already an instance of :class:`_engine.Connection`, or a newly procured :class:`_engine.Connection` if this object is an instance of :class:`_engine.Engine`. """ engine = None """The :class:`_engine.Engine` instance referred to by this :class:`.Connectable`. May be ``self`` if this is already an :class:`_engine.Engine`. """ def execute(self, object_, *multiparams, **params): """Executes the given construct and returns a :class:`_engine.CursorResult`. """ raise NotImplementedError() def scalar(self, object_, *multiparams, **params): """Executes and returns the first column of the first row. The underlying cursor is closed after execution. """ raise NotImplementedError() def _run_visitor(self, visitorcallable, element, **kwargs): raise NotImplementedError() def _execute_clauseelement(self, elem, multiparams=None, params=None): raise NotImplementedError() ```

Instead, sqlalchemy _engine.Connection will be the only object that features an execute() method, and _engine.Engine will be the only object that features connect() method.

We probably should provide a create_engine factory method that forwards all args and kwargs to the appropriate constructor. The resulting engine object should have a connect method that can also take args and kwargs. Our class should probably implement Python Database API Specification v2.0, i.e. DB-API 2.0. According to this StackOverflow answer, we just need to implement conn.cursor(), cursor.execute(), and eithercursor.fetchmany() or cursor.fetchall().

devin-petersohn commented 2 years ago

Ideally for a lot of these implementations, we can just pass them through.

vnlitvinov commented 2 years ago

Cross-reference: https://github.com/modin-project/modin/issues/4048