pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.43k stars 1.97k forks source link

`read_database*` methods are slow (Oracle database). #18738

Open avhz opened 2 months ago

avhz commented 2 months ago

Checks

Reproducible example

## ============================================================================
## IMPORTS
## ============================================================================

import polars
import sqlalchemy
import time
import oracledb
import pydantic
import os

## ============================================================================
## Oracle DB Connector
## ============================================================================

class Oracle(pydantic.BaseModel):
    usr: str = os.environ["ORACLE_USR"]
    pwd: str = os.environ["ORACLE_PWD"]
    host: str = os.environ["ORACLE_HST"]
    port: str = os.environ["ORACLE_PRT"]
    service: str = os.environ["ORACLE_SRV"]

    _uri: str
    _odbc: str
    _connection_oracle: oracledb.Connection
    _connection_sqlalchemy: sqlalchemy.engine.Connection

    def model_post_init(self, __context) -> None:
        self._uri = f"oracle://{self.usr}:{self.pwd}@{self.host}:{self.port}/{self.service}"
        self._odbc = f"DRIVER={{Oracle}};UID={self.usr};PWD={self.pwd};DBQ={self.host}:{self.port}/{self.service}"
        self._connection_oracle = oracledb.connect(dsn=f"{self.host}:{self.port}/{self.service}", user=self.usr, password=self.pwd)
        self._connection_sqlalchemy = sqlalchemy.create_engine(self._uri).connect()

    def get_table_uri(self, table_name: str) -> polars.DataFrame:
        start = time.time()
        table = polars.read_database_uri(
            uri=self._uri,
            query=f"SELECT * FROM {table_name}",
        )
        self.print_time(start)
        return table

    def get_table_odbc(self, table_name: str) -> polars.DataFrame:
        start = time.time()
        table = polars.read_database(
            connection=self._odbc, query=f"SELECT * FROM {table_name}"
        )
        self.print_time(start)
        return table

    def get_table_oracle(self, table_name: str) -> polars.DataFrame:
        start = time.time()
        table = polars.read_database(
            connection=self._connection_oracle,
            query=f"SELECT * FROM {table_name}",
            infer_schema_length=None,
        )
        self.print_time(start)
        return table

    def get_table_sqlalchemy(self, table_name: str) -> polars.DataFrame:
        start = time.time()
        table = polars.read_database(
            connection=self._connection_sqlalchemy,
            query=f"SELECT * FROM {table_name}",
            infer_schema_length=None,
        )
        self.print_time(start)
        return table

    @staticmethod
    def print_time(start: float) -> None:
        print(f"Elapsed time: {time.time() - start} (seconds)")

## ============================================================================
## TESTING
## ============================================================================

oracle = Oracle()

table = "SOME_TABLE"

t1 = oracle.get_table_uri(table)
t2 = oracle.get_table_odbc(table)
t3 = oracle.get_table_oracle(table)
t4 = oracle.get_table_sqlalchemy(table)

Log output

No response

Issue description

I am trying to query an Oracle DB for a table with ~1.5 million rows.

I have tried the four methods shown above, which all take quite some time to complete.

The reason I say they are slow is that I query the same table from the same database using R (ROracle and dplyr) and this takes ~13 seconds (including the connection time itself).

I would expect the oracledb connection to be the fastest out of the Polars/Python methods (and it is), however it still takes ~23 seconds (close to double the time R/dplyr/ROracle takes).

The timings are:

Language Method Time
Python URI 104s
Python ODBC 273s
Python oracledb 23s
Python sqlalchemy 63s
R ROracle 13s

Am I doing something stupid ?

Expected behavior

I would expect that using Polars with Python oracledb connection to be at least on par with R (dplyr and ROracle connection).

Installed versions

``` --------Version info--------- Polars: 1.7.0 Index type: UInt32 Platform: Linux-5.4.0-190-generic-x86_64-with-glibc2.29 Python: 3.8.10 (default, Jul 29 2024, 17:02:10) [GCC 9.4.0] ```
eitsupi commented 2 months ago

Is it the exact same query being run against the DB between R and Python? Since dbplyr converts dplyr queries to SQL and executes them on the DB, it is likely to be faster than something like SELECT * FROM foo, which transfers all data locally.

avhz commented 2 months ago

Thanks for the comment @eitsupi :) Yes it's the same query, getting the full table.

alexander-beedie commented 2 months ago

I don't see anything Polars-specific there? It seems that all you're observing is that different Oracle drivers have different performance 🤔

There may be ways to optimise your connection/driver settings though, but our only overhead vs executing the query natively on the given connection comes from the final "and then load the results into a DataFrame" step (if not using an Arrow-aware driver).

avhz commented 2 months ago

Hi @alexander-beedie :)

I realized this after creating my issue, so tried the following:

start = time.time()
cursor = self._connection_oracle.cursor()
cursor.execute(f"SELECT * FROM {table_name}")
data = cursor.fetchall()
print(f"Got data in {time.time() - start} (seconds)")

start = time.time()
names = [desc[0] for desc in cursor.description]
table = polars.DataFrame(data, schema=names, infer_schema_length=None, orient="row")
print(f"Created DataFrame in {time.time() - start} (seconds)")

The resulting times were:

So fetching the data itself is relatively quick, albeit in list[tuple] form, then creating a DataFrame takes roughly the same time.

I have not timed the ROracle method in a similar fashion (the package does not provide the same interface, and is more an extension of DBI from what I can tell). But since both oracledb and ROracle use the same Oracle client library under the hood, I expect the data fetch time to be very similar between the two.

So my assumption is that the creation of the DataFrame itself is the bottleneck.