Poor performance when running a large statement. sqlalchemy + oracledb

kryvokhyzha commented 1 year ago

Hi, I have the following code:

import pandas as pd
from sqlalchemy.orm.session import Session

def compute(engine: Session) -> pd.DataFrame:
        # some steps
        # ....
        df = engine.query(...)

        with engine.get_bind().connect() as conn:
                df = pd.read_sql(sql=df.statement, con=conn)

        return df

If I compile SQL code from the df variable and then run it directly in DBeaver - this statement will complete in 17-20 seconds. And it's okay. Compiled SQL contains 347k characters.

If I run the calculation through pd.read_sql, then this code will run for about 16 minutes. I noticed that the delay happens in the _prepare method (see trace below).

File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/site-packages/oracledb/cursor.py", line 137, in _prepare
self._impl.prepare(statement, tag, cache_statement)
File "src/oracledb/impl/thin/cursor.pyx", line 213, in oracledb.thin_impl.ThinCursorImpl.prepare
File "src/oracledb/impl/thin/connection.pyx", line 235, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/connection.pyx", line 239, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/statement.pyx", line 184, in oracledb.thin_impl.Statement._prepare
File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)

At the same time, I can run simpler scripts via pd.read_sql and they execute normally.
Also, I build the same script like in df variable, but for PostgreSQL. And it run for about 35 seconds.

Issue

Is this expected behavior when working with large scripts?
Is there any way I can bypass this prepare step?

requirements.txt

# Apple M1 macOS Ventura 13.0
# python 3.10
SQLAlchemy==2.0.9
numpy==1.23.5
pandas==1.5.2
oracledb==1.3.0

Thank you!

cjbj commented 1 year ago

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

kryvokhyzha commented 1 year ago

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

Directly launched script in DBeaver and through sqlalchemy+oracledb are executed on one DB and from one user, I checked it
The results of the scripts when running directly and via `sqlalchemy+oracledb' are the same. Only the execution time differs
Also, these results coincide with a similar script on postgres, which is also generated via sqlalchemy
Unfortunately, I cannot test in think mode, because my PC does not have an oracle client
Tested my code with pandas==2.0.0 - nothing changed
I decided to check whether the problem is really in the regular expressions, which placed in the oracledb.thin_impl.Statement._prepare method (File "src/oracledb/impl/thin/statement.pyx"). Got the following results:
```
import re
import time
```

sql = """ -- large sql script """

print(len(sql)) # 351449

start = time.time() sql = re.sub(r"/*[\S\n ]+?*/", "", sql) print(time.time() - start) # 0.0004279613494873047

start = time.time() sql = re.sub(r"--.*(\n|$)", "", sql) print(time.time() - start) # 0.0004191398620605469

start = time.time() sql = re.sub(r"""'[^']'(?=(?:[^'][^']')[^']$)""", "", sql, flags=re.MULTILINE) print(time.time() - start) # 3571.5018050670624

start = time.time() sql = re.sub(r'(:\s)?("([^"])")', lambda m: m.group(0) if sql[m.start(0)] == ":" else "", sql) print(time.time() - start) # 0.007905006408691406


7. So, the problem is precisely in the third regular expression, which is executed for a very long time

anthony-tuininga commented 1 year ago

That's helpful. Are you able to share the large SQL script? You can e-mail it to me if you prefer (anthony.tuininga@gmail.com). Some adjustments to the regular expressions are planned to avoid these problems!

anthony-tuininga commented 1 year ago

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

kryvokhyzha commented 1 year ago

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

I have tested a patch that you pushed. Now code works really faster. Thank you!

anthony-tuininga commented 1 year ago

This has been included in version 1.3.1 which was just released!

oracle / python-oracledb

Poor performance when running a large statement. sqlalchemy + oracledb #172

Issue

requirements.txt