Open mauropagano opened 3 months ago
Thanks, @mauropagano. We will look into this and get back to you.
Tagging @aosingh who is exploring the space (e.g. he created https://github.com/apache/arrow-nanoarrow/pull/502) and working with us on what python-oracledb could do.
@mauropagano if you have psuedocode of end-to-end uses of this datatype, and how it would plug in to Pandas, SQLAlchemy etc (or suggestions of what would need to be done to such libraries), please share.
Thanks @cjbj
These are a couple quick pseudo-code examples that come to mind, but really this feature would enable saving time / memory of almost all use cases where data starts or land in Oracle and any of the modern python data processing/science libs are in the mix.
The main goal of this ER in my mind is to bypass going through python types just to immediately after going into arrow (or viceversa), that takes a significant toll on performance and memory for no real benefit.
Some examples below will look a bit clunky because as of now there is no yet a zero-copy way to create a pandas df from arrow(1), I believe
(1)lots of ways have been already added in pandas2 to go from at-rest formats to pandas with arrow backend, you just can't yet do "map" a df on top of that arrow table
In batches, via pyarrow.RecordBatch
cursor.execute("....")
# this would return a pyarrow.RecordBatch, could work like fetchmany()
for rb in cursor.fetch_record_batch(n=100_000):
# the to_pandas() is because of (1) , but this is DRASTICALLY faster than tuple/dict -> pandas
fancy_lib_output = fancy_lib.do_fancy_stuff(rb.to_pandas())
All at once, via pyarrow.Table
cursor.execute("....")
# same (1) applies, once to_pandas() can be skipped for some pd.DataFrame(arrow_obj) you'd save the temporary 2x memory
df = cursor.fetch_arrow_table().to_pandas(self_destruct=True)
fancy_lib_output = fancy_lib.do_fancy_stuff(df)
In batches, from pq
pf = pyarrow.parquet.ParquetFile(f)
# there are a lot of ways to do this, picking reading a single row_group at time vs iter_batches() just to minimize waste
for rg_num in range(pf.num_row_groups) :
cursor.execute("insert into ...", pf.read_row_group(rg_num)) # executemany maybe? turboodbc calls it executemanycolumns()
All at once, via pandas.read_csv(..., dtype_backend="pyarrow")
(somewhat example of (1) above, thought this is quite silly, if the goal is to just load csv there are better ways
df = pd.read_csv(f, dtype_backend="pyarrow")
cursor.execute("insert into ...", df)
From other libraries with arrow backend, e.g. polars
for df_slice in df.iter_slices(n_rows=100_000): # df here is polars.DataFrame
cursor.execute("insert into ...", df_slice.to_arrow()) # to_arrow() is mostly zero-copy
Hope it helps, anything else I can provide just let me know
Note: Arrow is not the only format for storing big tables in a columnar format. There is also the Apache Parquet format, but Oracle already supports reading and writing parquet files on the DB side, see https://blogs.oracle.com/datawarehousing/post/parquet-files-oracle-database.
I think what would generally be more useful is a SQL*Plus (and/or sqlcli) command to upload (or download) files as BLOB data from/to the client to the DB.
It's hard to deny how Arrow has become the standard de-facto for in-memory representation of tabular data. Multiple competing product (Snowflake, BigQuery, etc) as well as libraries (ADBC, TurboODBC) enable extracting data as Arrow, unlocking better performance, lower memory usage and increased consistency / interoperability between platforms.
In the python space, there are a number of libraries, especially around data processing, data science and ML (pandas, polars, duckdb, etc) that work on Arrow data natively, normally via zero-copy.
It would be extremely beneficial to have
python-oracledb
be able to:cursor.fetch_as_arrow()
that can return either an Arrow Table or RecordBatch from the query. This method could bypass a python representation, speed up data extraction and ultimately keep Oracle closer to where some processing occurs.cursor.executemany("', arrow_object)
could skip python representation, use less memory and ultimately entice users to rely more on Oracle for that processing that works better in-db / storing of data produced elsewhere