Open avhz opened 2 months ago
Is it the exact same query being run against the DB between R and Python?
Since dbplyr
converts dplyr
queries to SQL and executes them on the DB, it is likely to be faster than something like SELECT * FROM foo
, which transfers all data locally.
Thanks for the comment @eitsupi :) Yes it's the same query, getting the full table.
I don't see anything Polars-specific there? It seems that all you're observing is that different Oracle drivers have different performance 🤔
There may be ways to optimise your connection/driver settings though, but our only overhead vs executing the query natively on the given connection comes from the final "and then load the results into a DataFrame" step (if not using an Arrow-aware driver).
Hi @alexander-beedie :)
I realized this after creating my issue, so tried the following:
start = time.time()
cursor = self._connection_oracle.cursor()
cursor.execute(f"SELECT * FROM {table_name}")
data = cursor.fetchall()
print(f"Got data in {time.time() - start} (seconds)")
start = time.time()
names = [desc[0] for desc in cursor.description]
table = polars.DataFrame(data, schema=names, infer_schema_length=None, orient="row")
print(f"Created DataFrame in {time.time() - start} (seconds)")
The resulting times were:
DataFrame
creation: 9.98sSo fetching the data itself is relatively quick, albeit in list[tuple]
form, then creating a DataFrame
takes roughly the same time.
I have not timed the ROracle
method in a similar fashion (the package does not provide the same interface, and is more an extension of DBI
from what I can tell). But since both oracledb
and ROracle
use the same Oracle client library under the hood, I expect the data fetch time to be very similar between the two.
So my assumption is that the creation of the DataFrame
itself is the bottleneck.
Checks
Reproducible example
Log output
No response
Issue description
I am trying to query an Oracle DB for a table with ~1.5 million rows.
I have tried the four methods shown above, which all take quite some time to complete.
The reason I say they are slow is that I query the same table from the same database using R (ROracle and dplyr) and this takes ~13 seconds (including the connection time itself).
I would expect the
oracledb
connection to be the fastest out of the Polars/Python methods (and it is), however it still takes ~23 seconds (close to double the time R/dplyr/ROracle takes).The timings are:
oracledb
sqlalchemy
ROracle
Am I doing something stupid ?
Expected behavior
I would expect that using Polars with Python
oracledb
connection to be at least on par with R (dplyr andROracle
connection).Installed versions