oracle / python-oracledb

Python driver for Oracle Database conforming to the Python DB API 2.0 specification. This is the renamed, new major release of cx_Oracle
https://oracle.github.io/python-oracledb
Other
325 stars 64 forks source link

Add support for fetching and ingesting Arrow Table/RecordBatch data #375

Open mauropagano opened 1 month ago

mauropagano commented 1 month ago

It's hard to deny how Arrow has become the standard de-facto for in-memory representation of tabular data. Multiple competing product (Snowflake, BigQuery, etc) as well as libraries (ADBC, TurboODBC) enable extracting data as Arrow, unlocking better performance, lower memory usage and increased consistency / interoperability between platforms.

In the python space, there are a number of libraries, especially around data processing, data science and ML (pandas, polars, duckdb, etc) that work on Arrow data natively, normally via zero-copy.

It would be extremely beneficial to have python-oracledb be able to:

  1. Extract data as Arrow, something like cursor.fetch_as_arrow() that can return either an Arrow Table or RecordBatch from the query. This method could bypass a python representation, speed up data extraction and ultimately keep Oracle closer to where some processing occurs.
  2. Opposite direction, enabling ingesting Arrow Table/RecordBatch into the database, something like cursor.executemany("', arrow_object) could skip python representation, use less memory and ultimately entice users to rely more on Oracle for that processing that works better in-db / storing of data produced elsewhere
anthony-tuininga commented 1 month ago

Thanks, @mauropagano. We will look into this and get back to you.

cjbj commented 1 month ago

Tagging @aosingh who is exploring the space (e.g. he created https://github.com/apache/arrow-nanoarrow/pull/502) and working with us on what python-oracledb could do.

@mauropagano if you have psuedocode of end-to-end uses of this datatype, and how it would plug in to Pandas, SQLAlchemy etc (or suggestions of what would need to be done to such libraries), please share.

mauropagano commented 1 month ago

Thanks @cjbj

These are a couple quick pseudo-code examples that come to mind, but really this feature would enable saving time / memory of almost all use cases where data starts or land in Oracle and any of the modern python data processing/science libs are in the mix.

The main goal of this ER in my mind is to bypass going through python types just to immediately after going into arrow (or viceversa), that takes a significant toll on performance and memory for no real benefit.

Some examples below will look a bit clunky because as of now there is no yet a zero-copy way to create a pandas df from arrow(1), I believe

(1)lots of ways have been already added in pandas2 to go from at-rest formats to pandas with arrow backend, you just can't yet do "map" a df on top of that arrow table

Extracting data from Oracle to pandas because you need to apply fancy lib X that requires pandas

In batches, via pyarrow.RecordBatch

cursor.execute("....")
 # this would return a pyarrow.RecordBatch, could work like fetchmany()
for rb in cursor.fetch_record_batch(n=100_000):
     # the to_pandas() is because of (1) , but this is DRASTICALLY faster than tuple/dict -> pandas
     fancy_lib_output = fancy_lib.do_fancy_stuff(rb.to_pandas()) 

All at once, via pyarrow.Table

cursor.execute("....")
# same (1) applies, once to_pandas() can be skipped for some pd.DataFrame(arrow_obj) you'd save the temporary 2x memory 
df = cursor.fetch_arrow_table().to_pandas(self_destruct=True)
fancy_lib_output = fancy_lib.do_fancy_stuff(df) 

Ingesting data into Oracle

In batches, from pq

pf = pyarrow.parquet.ParquetFile(f)
# there are a lot of ways to do this, picking reading a single row_group at time vs iter_batches() just to minimize waste
for rg_num in range(pf.num_row_groups) :
     cursor.execute("insert into ...", pf.read_row_group(rg_num)) # executemany maybe? turboodbc calls it executemanycolumns()

All at once, via pandas.read_csv(..., dtype_backend="pyarrow") (somewhat example of (1) above, thought this is quite silly, if the goal is to just load csv there are better ways

df = pd.read_csv(f, dtype_backend="pyarrow") 
cursor.execute("insert into ...", df)

From other libraries with arrow backend, e.g. polars

for df_slice in df.iter_slices(n_rows=100_000):  # df here is polars.DataFrame
    cursor.execute("insert into ...", df_slice.to_arrow()) # to_arrow() is mostly zero-copy 

Hope it helps, anything else I can provide just let me know

hvbtup commented 3 weeks ago

Note: Arrow is not the only format for storing big tables in a columnar format. There is also the Apache Parquet format, but Oracle already supports reading and writing parquet files on the DB side, see https://blogs.oracle.com/datawarehousing/post/parquet-files-oracle-database.

I think what would generally be more useful is a SQL*Plus (and/or sqlcli) command to upload (or download) files as BLOB data from/to the client to the DB.