Open timrburnham opened 7 months ago
Hello @timrburnham ,
It would be convenient to be able to re-use specific database connections for explicitly sequential tasks. For instance, declare a temporary table, insert some rows from an Arrow table, and join against that session table in a subsequent select.
If it is about the saving the time to reconnect to the database, connection pooling does enable this. This sounds more like you want everything happening in the same transaction? Also, while declaring temporary tables is currently possible in arrow-odbc
because it does execute arbitrary SQL statements the scope and intention of the package is currently only data insertion in and out of tables. With my current day-job I am a bit anxious if I could tackle the scope of another pyodbc
. I am already maintaining the odbc-api
bindings for Rust.
So I would need to understand very precisely why you want these commands on the same connection. Maybe there is something here which still fits in the scope of the arrow-odbc
Python bindings. It could very well be.
Alternatively you could also try:
arrow-odbc
directly from Rust. You can do everything you want in conjunction with odbc-api
. Even fancier stuff like reusing Connection
and Statement
handles.turbodbc
it is potentially harder to install and behaves differently in some situations, but it does allow fast bulk fetch into arrow arrays and it is broader in scope, including explicit connections in Python code.I assume there are Rust ownership issues that make this difficult
As I see it every piece of Software has ownership issues. It is just in Rust the compiler tells you about them.
Thank for the quick analysis! I don't think I need transaction control, just the same session--I really thought it might work with connection pooling. I'm not clear what's being reset.
In this test case, I'm using the DuckDB driver under UnixODBC, since it's simple to setup and use. For reference, my ~/.odbc.ini contains:
[DuckDB]
Driver=DuckDB Driver
Database=:memory:
and my ~/.odbcinst.ini:
[DuckDB Driver]
driver = /home/tim/libduckdb_odbc.so
Here's a non-working case, using arrow-odbc:
arrow_odbc.enable_odbc_connection_pooling()
# create temp table, DuckDB syntax
sql = """\
create temporary table temp_keys (
k varchar(128)
)
on commit preserve rows
;"""
arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')
# insert a bunch of rows from Arrow Table
filter = pa.Table.from_pydict({"k": ["key1", "key2", "key3"]})
reader = pa.RecordBatchReader.from_batches(filter.schema, filter.to_batches())
arrow_odbc.insert_into_table(
reader=reader,
chunk_size=1000,
table="temp_keys",
connection_string='DSN=DuckDB;',
)
sql = """\
select v from
values ('key1', 1), ('key6', 6) as fake_big_table(k, v)
inner join temp_keys
using (k)
;"""
results = arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')
Results:
>>> results = arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/reader.py", line 493, in read_arrow_batches_from_odbc
reader.query(
File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/reader.py", line 127, in query
raise_on_error(error)
File "/home/tim/venv/lib64/python3.11/site-packages/arrow_odbc/error.py", line 30, in raise_on_error
raise Error(error_out)
arrow_odbc.error.Error: ODBC emitted an error calling 'SQLExecDirect':
State: 42000, Native error: 0, Message: ODBC_DuckDB->PrepareStmt
Catalog Error: Table with name temp_keys does not exist!
Did you mean "pg_views"?
LINE 3: inner join temp_keys
^
And after pip installing the native DuckDB driver, here's a working version where we explicitly reuse same connection:
duck = duckdb.connect(':memory:')
sql = """\
create temporary table temp_keys (
k varchar(128)
)
on commit preserve rows
;"""
duck.sql(sql)
filter = pa.Table.from_pydict({"k": ["key1", "key2", "key3"]})
duck.sql("insert into temp_keys(k) select k from filter;")
sql = """\
select *
from values ('key1', 1), ('key6', 6) as fake_big_table(k, v)
inner join temp_keys
using (k)
;"""
results = duck.sql(sql).to_arrow_table()
Results:
>>> results
pyarrow.Table
k: string
v: int32
----
k: [["key1"]]
v: [[1]]
The general idea is, if I have a million rows in a table, and I only want a thousand of them, if I already know the keys it's easy to upload them into a temp table and join against my big table. You can get pretty far binding parameters to where clauses, but joins are better in some cases.
Hello @timrburnham ,
thanks for the detailed response and sorry for the delayed answer. The use case is legit, however it is quite removed from there arrow-odbc
is currently in terms of interface. Your first statement already illustrates this well:
# create temp table, DuckDB syntax
sql = """\
create temporary table temp_keys (
k varchar(128)
)
on commit preserve rows
;"""
arrow_odbc.read_arrow_batches_from_odbc(sql, 'DSN=DuckDB;')
Although working, this is of course not the intended use of read_arrow_batches_from
, as you are only interessted in the "side effect" of temporary table creation.
I'll give it some thought. However please not that, at least during the month of May, I won't have any time to act on this, so please be aware that you will need a workaround / different solution in the meanwhile.
Best, Markus
Thanks very much, Markus! Using permanent tables is fine as a workaround, of course, I'm only thinking of the ergonomics. I really really like using arrow-odbc, thanks for a fantastic package!
This is mostly a note to myself:
I have not decided yet, whether to take this into the scope of the arrow-odbc
Python bindings. If so however I can see two designs working.
Connection
s are explicitly instantiated by the user. They have their own representation in Python code and could be passed explicitly in the creation of readers and writers. On the Rust side this would imply storing them in an Arc<Mutext<_>>
instead of taking full ownership of them. Mutex
would be required explicitly to ensure error messages end up correctly on the thread that caused them. This design would not only allow to use the same connection for multiple statements in succession, but multiple statements might be active at the same time utilizing the same connection. This is supported according to ODBC, but I am fearful this would create hard to debug errors with some drivers.
Connection
s are not represented explicitly in user code. Instead there is only one handle to an object representing all the possible states of arrow-odbc
. On the Rust side this would imply unifying the Reader
and Writer
into a single enum
. In additon we must not close the Connection
as soon as the Reader
is consumed, but rather would only fall back to Connected
state.
On the Python side the user would interact with an object offering all the high level functionality and returning errors if they are called from an invalid state.
odbc-api
allowing StatementConnection
to own e.g. Arc
connection types.Approach 1: Supporting Arc
would likely imply relying more on the native thread safety of the drivers. At least if we want to keep concurrent fetching, we would need to also implement Send
for StatementConnection<Arc<_>>
. See also: https://docs.rs/odbc-api/latest/odbc_api/struct.StatementConnection.html#impl-Send-for-StatementConnection%3C'c%3E
Approach 2: Schema member can likely be updated on state changes. I dislike that writer and reader would have a "merged" interface.
Note to myself:
Going forward we aim to combine the 1. and 2. I.e. We offer an explicit connection object. However internally it would be modeled as Arc<Mutex<Option<Connection>>>
instead of Arc<Mutex<Connection>>
allowing the reader or writer to take full ownership of the connection. The reusing of the connection would be implemented by giving it back to the original Arc pointer.
This sounds ideal! Would it be practical for the Python Connection object to directly offer an execute()
method, for performing DDL or other 0-result statements? As you pointed out above, I was abusing read_arrow_batches_from_odbc()
for side-effects, without reading any Arrow data.
It would be convenient to be able to re-use specific database connections for explicitly sequential tasks. For instance, declare a temporary table, insert some rows from an Arrow table, and join against that session table in a subsequent select.
Things I have tried unsucessfully:
I assume there are Rust ownership issues that make this difficult, but it would be amazing for ETL jobs which compute intermediate data sets. Right now I'm creating tables dynamically and dropping them after.