pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
60 stars 5 forks source link

Explicit connection reuse #92

Open timrburnham opened 7 months ago

timrburnham commented 7 months ago

It would be convenient to be able to re-use specific database connections for explicitly sequential tasks. For instance, declare a temporary table, insert some rows from an Arrow table, and join against that session table in a subsequent select.

Things I have tried unsucessfully:

I assume there are Rust ownership issues that make this difficult, but it would be amazing for ETL jobs which compute intermediate data sets. Right now I'm creating tables dynamically and dropping them after.

pacman82 commented 7 months ago

Hello @timrburnham ,

It would be convenient to be able to re-use specific database connections for explicitly sequential tasks. For instance, declare a temporary table, insert some rows from an Arrow table, and join against that session table in a subsequent select.

If it is about the saving the time to reconnect to the database, connection pooling does enable this. This sounds more like you want everything happening in the same transaction? Also, while declaring temporary tables is currently possible in arrow-odbc because it does execute arbitrary SQL statements the scope and intention of the package is currently only data insertion in and out of tables. With my current day-job I am a bit anxious if I could tackle the scope of another pyodbc. I am already maintaining the odbc-api bindings for Rust.

So I would need to understand very precisely why you want these commands on the same connection. Maybe there is something here which still fits in the scope of the arrow-odbc Python bindings. It could very well be.

Alternatively you could also try:

I assume there are Rust ownership issues that make this difficult

As I see it every piece of Software has ownership issues. It is just in Rust the compiler tells you about them.

timrburnham commented 7 months ago

Thank for the quick analysis! I don't think I need transaction control, just the same session--I really thought it might work with connection pooling. I'm not clear what's being reset.

In this test case, I'm using the DuckDB driver under UnixODBC, since it's simple to setup and use. For reference, my ~/.odbc.ini contains:

[DuckDB]
Driver=DuckDB Driver
Database=:memory:

and my ~/.odbcinst.ini:

[DuckDB Driver]
driver = /home/tim/libduckdb_odbc.so

Here's a non-working case, using arrow-odbc:

arrow_odbc.enable_odbc_connection_pooling()

# create temp table, DuckDB syntax
sql = """\
create temporary table temp_keys (
  k varchar(128)
)
on commit preserve rows
;"""
arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')

# insert a bunch of rows from Arrow Table
filter = pa.Table.from_pydict({"k": ["key1", "key2", "key3"]})
reader = pa.RecordBatchReader.from_batches(filter.schema, filter.to_batches())
arrow_odbc.insert_into_table(
    reader=reader,
    chunk_size=1000,
    table="temp_keys",
    connection_string='DSN=DuckDB;',
)

sql = """\
select v from
  values ('key1', 1), ('key6', 6) as fake_big_table(k, v)
inner join temp_keys
  using (k)
;"""
results = arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')

Results:

>>> results = arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/reader.py", line 493, in read_arrow_batches_from_odbc
    reader.query(
  File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/reader.py", line 127, in query
    raise_on_error(error)
  File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/error.py", line 30, in raise_on_error
    raise Error(error_out)
arrow_odbc.error.Error: ODBC emitted an error calling 'SQLExecDirect':
State: 42000, Native error: 0, Message: ODBC_DuckDB->PrepareStmt
Catalog Error: Table with name temp_keys does not exist!
Did you mean "pg_views"?
LINE 3: inner join temp_keys
                   ^

And after pip installing the native DuckDB driver, here's a working version where we explicitly reuse same connection:

duck = duckdb.connect(':memory:')
sql = """\
create temporary table temp_keys (
  k varchar(128)
)
on commit preserve rows
;"""
duck.sql(sql)

filter = pa.Table.from_pydict({"k": ["key1", "key2", "key3"]})
duck.sql("insert into temp_keys(k) select k from filter;")

sql = """\
select *
from values ('key1', 1), ('key6', 6) as fake_big_table(k, v)
inner join temp_keys
  using (k)
;"""
results = duck.sql(sql).to_arrow_table()

Results:

>>> results
pyarrow.Table
k: string
v: int32
----
k: [["key1"]]
v: [[1]]

The general idea is, if I have a million rows in a table, and I only want a thousand of them, if I already know the keys it's easy to upload them into a temp table and join against my big table. You can get pretty far binding parameters to where clauses, but joins are better in some cases.

pacman82 commented 7 months ago

Hello @timrburnham ,

thanks for the detailed response and sorry for the delayed answer. The use case is legit, however it is quite removed from there arrow-odbc is currently in terms of interface. Your first statement already illustrates this well:

# create temp table, DuckDB syntax
sql = """\
create temporary table temp_keys (
  k varchar(128)
)
on commit preserve rows
;"""
arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')

Although working, this is of course not the intended use of read_arrow_batches_from, as you are only interessted in the "side effect" of temporary table creation.

I'll give it some thought. However please not that, at least during the month of May, I won't have any time to act on this, so please be aware that you will need a workaround / different solution in the meanwhile.

Best, Markus

timrburnham commented 7 months ago

Thanks very much, Markus! Using permanent tables is fine as a workaround, of course, I'm only thinking of the ergonomics. I really really like using arrow-odbc, thanks for a fantastic package!

pacman82 commented 5 months ago

This is mostly a note to myself:

I have not decided yet, whether to take this into the scope of the arrow-odbc Python bindings. If so however I can see two designs working.

  1. Connections are explicitly instantiated by the user. They have their own representation in Python code and could be passed explicitly in the creation of readers and writers. On the Rust side this would imply storing them in an Arc<Mutext<_>> instead of taking full ownership of them. Mutex would be required explicitly to ensure error messages end up correctly on the thread that caused them. This design would not only allow to use the same connection for multiple statements in succession, but multiple statements might be active at the same time utilizing the same connection. This is supported according to ODBC, but I am fearful this would create hard to debug errors with some drivers.

  2. Connections are not represented explicitly in user code. Instead there is only one handle to an object representing all the possible states of arrow-odbc. On the Rust side this would imply unifying the Reader and Writer into a single enum. In additon we must not close the Connection as soon as the Reader is consumed, but rather would only fall back to Connected state. On the Python side the user would interact with an object offering all the high level functionality and returning errors if they are called from an invalid state.

pacman82 commented 5 months ago
pacman82 commented 5 months ago
pacman82 commented 4 months ago

Note to myself:

Going forward we aim to combine the 1. and 2. I.e. We offer an explicit connection object. However internally it would be modeled as Arc<Mutex<Option<Connection>>> instead of Arc<Mutex<Connection>> allowing the reader or writer to take full ownership of the connection. The reusing of the connection would be implemented by giving it back to the original Arc pointer.

timrburnham commented 4 months ago

This sounds ideal! Would it be practical for the Python Connection object to directly offer an execute() method, for performing DDL or other 0-result statements? As you pointed out above, I was abusing read_arrow_batches_from_odbc() for side-effects, without reading any Arrow data.