python / cpython

The Python programming language
https://www.python.org/
Other
60.06k stars 29.09k forks source link

sqlite3 - 35,000x slower create table when table name begins with a number #118448

Closed MLpranav closed 2 weeks ago

MLpranav commented 3 weeks ago

Bug report

Bug description:

import sqlite3, datetime

sqlite3.enable_callback_tracebacks(True)

conn = sqlite3.connect('test.db')
conn.set_trace_callback(print)
conn.execute("PRAGMA journal_mode=DELETE")

print(datetime.datetime.now())
conn.execute('CREATE TABLE "t09" ("time" TEXT);') # this takes ~2 milliseconds
print(datetime.datetime.now())

print(datetime.datetime.now())
conn.execute('CREATE TABLE "09" ("time" TEXT);') # this takes 2/70 seconds on a 1/15 GB database
print(datetime.datetime.now())

Output when test.db is ~15 GB with ~3500 tables with 100k rows each:

PRAGMA journal_mode=DELETE
2024-05-01 01:04:34.281414
CREATE TABLE "t09" ("time" TEXT);
2024-05-01 01:04:34.282706
2024-05-01 01:04:34.282715
CREATE TABLE "09" ("time" TEXT);
2024-05-01 01:05:43.068583

Output when test.db is ~1 GB with ~1000 tables with 100k rows each:

PRAGMA journal_mode=DELETE
2024-05-01 01:09:44.522383
CREATE TABLE "t09" ("time" TEXT);
2024-05-01 01:09:44.546076
2024-05-01 01:09:44.546085
CREATE TABLE "09" ("time" TEXT);
2024-05-01 01:09:46.693444

sqlite3 on macOS is taking 1000-35000x more time to create a new table when the table name begins with a number.

This issue is not present on Windows using the exact same code and versions.

This issue is not present in the sqlite CLI on the same macOS machine, so it is a sqlite3 module issue.

The delay is a lot more prominent on large databases, but it exists on small databases as well.

CPython versions tested on:

3.11

Operating systems tested on:

macOS, Windows

erlend-aasland commented 3 weeks ago

sqlite3 on macOS is taking 1000-35000x more time to create a new table when the table name begins with a number.

The sqlite3 extension module does not create the table; it only passes the query as it is on to the SQLite library, which then executes the query. As such, I don't see how this could be a problem with the sqlite3 extension module at all.

erlend-aasland commented 3 weeks ago

Since the sqlite3 extension module does not interfere with the query that is being passed to the SQLite library, there is nothing we can do to mitigate your problem. Suggesting to close this as not-a-bug.

zware commented 3 weeks ago

It's possible that this is a bug in how we built sqlite3 for macOS, but we'll need rather more information (how Python was installed, how libsqlite3 was sourced, exact versions of each, etc.) before that could be confirmed, much less fixed.

h4zielh commented 3 weeks ago

I think that's because sqlite3 needs to check if the number preceding the text is part of a numerical expression, a reserved keyword, or an error and decide what is true, and to do that it needs to perform an exhausting verification which consumes more resources from the machine, that's why in various languages you can't declare a variable name beggining with a number (the majority of them will just prompt an error, python by example, returns a SyntaxError exception) and given that your using a MacOs, the operational system might be more strict about allowing resources to processes or might have a different escalation system so it just adds with the syntax check. In general, i strongly recommend you using number after letters, not only because of syntax check, but also, for readbility. If you really need using numbers before letters in this case, then i'm sorry, i don't know what to recommend, maybe reinstalling sqlite3, or formatting the whole OS, but i don't think these will solve the issue.

MLpranav commented 2 weeks ago

The sqlite3 extension module does not create the table; it only passes the query as it is on to the SQLite library

This delay is not present in the official sqlite shell. So it is highly probable that the issue is on Python's end.

Some preprocessing is being done in the extension module before sending the queries to the official sqlite library, this might be the culprit.

I am trying to build Python with a modified pysqlite_cursor_execute_impl function located in cpython/Modules/_sqlite/cursor.c to print timestamps every few lines so that I can isolate the exact part of code causing the delay.

In general, i strongly recommend you using number after letters, not only because of syntax check, but also, for readbility.

I am using YYYYMMDD table names to store time series data. Using a letter before the numbers here might not be ideal.


Thanks for the responses. I hope that we can get to the bottom of this soon.

ronaldoussoren commented 2 weeks ago

The sqlite3 extension module does not create the table; it only passes the query as it is on to the SQLite library

This delay is not present in the official sqlite shell. So it is highly probable that the issue is on Python's end.

How did you install Python (homebrew, python.org installer, Apple's devtools, ...)? Did you check that the official sqlite3 shell is linked to the same sqlite library as the _sqlite3 extension module?

Anyway, the code below ends up using a very small amount of time for the first table (0.000085 seconds) and a significant amount of time for the second table (1.397151 seconds) when using Python 3.12 installed using the python.org installer.


import sqlite3, datetime, time, contextlib

@contextlib.contextmanager
def timed():
    before = time.time()
    yield
    after = time.time()
    print("took %fs"%(after-before))

sqlite3.enable_callback_tracebacks(True)

conn = sqlite3.connect('test.db')

for i in range(200):
    print(f"TAB {i}")
    conn.execute(f'CREATE TABLE tab{i} ("value" INT)')
    for j in range(100_000):
        conn.execute(f"INSERT INTO tab{i} (value) VALUES ({j})")

conn.set_trace_callback(print)
conn.execute("PRAGMA journal_mode=DELETE")

with timed():
    conn.execute('CREATE TABLE "t09" ("time" TEXT);')

with timed():
    conn.execute('CREATE TABLE "09" ("time" TEXT);')
MLpranav commented 2 weeks ago

I am using Python 3.11.6 installed via homebrew.

Did you check that the official sqlite3 shell is linked to the same sqlite library as the _sqlite3 extension module?

Most probably they both include their source for the same library and the exact version may not be the same. Will look into this.

erlend-aasland commented 2 weeks ago

Thanks for the repro, Ronald. Some observations:

Some preprocessing is being done in the extension module before sending the queries to the official sqlite library, this might be the culprit.

No preprocessing is done; we only detect the type of query. The code is here:

https://github.com/python/cpython/blob/21c09d9f8195433f34b72ddfa25dd1bda3019ed7/Modules/_sqlite/statement.c#L73-L82

erlend-aasland commented 2 weeks ago

FTR, I'm trying to reproduce using the SQLite CLI. It takes some time though, because as I already know, using a pre-created database will not work; you'll have to first create an SQL script that creates the database, then .read that into the SQLite shell, and then execute the "offending queries", all in the same CLI session.

ronaldoussoren commented 2 weeks ago

I get this:

% time /path/to/bin/sqlite3 test.db 'CREATE TABLE "011" ("time" TEXT);'
/path/to/bin/sqlite3 test.db   0.89s user 0.07s system 99% cpu 0.970 total

% time /path/to/cpython-dev/build_deps/bin/sqlite3 test.db 'CREATE TABLE "t011" ("time" TEXT);'
/path/to/build_deps/bin/sqlite3 test.db   0.00s user 0.00s system 68% cpu 0.010 total

This is with the "test.db" created by my reproducer, using sqlite 3.44.2 build in a similar way as used by the CPython installer. /usr/bin/sqlite3 is fast for both variants (3.43.2 as shipped on macOS 14.4.1).

ronaldoussoren commented 2 weeks ago

A possibly relevant news entry for SQLite 3.44.0 (item 6 in their list:

Many errors associated with CREATE TABLE are now raised when the CREATE TABLE statement itself is run, rather than being deferred until the first time the table is actually used.

erlend-aasland commented 2 weeks ago

This is with the "test.db" created by my reproducer, using sqlite 3.44.2 build in a similar way as used by the CPython installer. /usr/bin/sqlite3 is fast for both variants (3.43.2 as shipped on macOS 14.4.1).

Yeah, but see https://github.com/python/cpython/issues/118448#issuecomment-2088188643:

[...] if you pre-create the database and only perform the "offending queries" you will not be able to observe the slowdown.

MLpranav commented 2 weeks ago

[...] if you pre-create the database and only perform the "offending queries" you will not be able to observe the slowdown.

I have noticed the same using the repro code given above.

But in my initial scenario, connecting to an existing database also has the same delay.

Maybe the test database has to be large enough for the slowdown to occur in a separate session.

erlend-aasland commented 2 weeks ago

A possibly relevant news entry for SQLite 3.44.0 (item 6 in their list:

Many errors associated with CREATE TABLE are now raised when the CREATE TABLE statement itself is run, rather than being deferred until the first time the table is actually used.

I tried with both 3.44.0 and 3.45.3, but it has no effect; I still observe the slowdown (if the database is created in the same session).

erlend-aasland commented 2 weeks ago

But in my initial scenario, connecting to an existing database also has the same delay.

Maybe the test database has to be large enough for the slowdown to occur in a separate session.

Please try to reproduce this so we're 100% sure about this observation.

erlend-aasland commented 2 weeks ago

Please try to reproduce this so we're 100% sure about this observation.

No need, I reproduced it myself.

MLpranav commented 2 weeks ago

sqlite3 -> v3.43.2 preinstalled on macOS 14.4.1. sqlite_new -> alias for v3.45.3 installed from homebrew.

time sqlite3 test.db 'CREATE TABLE "111" ("time" TEXT);'
time sqlite3 test.db 'CREATE TABLE "a111" ("time" TEXT);'

time sqlite_new test.db 'CREATE TABLE "211" ("time" TEXT);'
time sqlite_new test.db 'CREATE TABLE "a211" ("time" TEXT);'
sqlite3 test.db 'CREATE TABLE "111" ("time" TEXT);'  0.00s user 0.01s system 48% cpu 0.029 total
sqlite3 test.db 'CREATE TABLE "a111" ("time" TEXT);'  0.00s user 0.00s system 60% cpu 0.009 total

/opt/homebrew/opt/sqlite/bin/sqlite3 test.db   0.42s user 0.43s system 45% cpu 1.874 total
/opt/homebrew/opt/sqlite/bin/sqlite3 test.db   0.00s user 0.00s system 69% cpu 0.007 total

Seems like that it is confirmed that the problem lies with sqlite3 itself and not Python.

Should I close this issue and raise it with the sqlite3 team?

MLpranav commented 2 weeks ago
time sqlite3 testbig.db 'CREATE TABLE "211" ("time" TEXT);'
time sqlite_new testbig.db 'CREATE TABLE "212" ("time" TEXT);'

sqlite3 testbig.db 'CREATE TABLE "211" ("time" TEXT);'  0.02s user 0.01s system 54% cpu 0.049 total
/opt/homebrew/opt/sqlite/bin/sqlite3 testbig.db   15.41s user 13.68s system 44% cpu 1:05.64 total

The delay is a lot more pronounced on a large, ~15 GB database.

erlend-aasland commented 2 weeks ago

The slowdown can be pinpointed to the sqlite3_step C API call. This implies:

Seems like that it is confirmed that the problem lies with sqlite3 itself and not Python.

I assume you mean SQLite, and not sqlite3.

Should I close this issue and raise it with the sqlite3 team?

Sounds good to me. Note that they require accurate, deterministic reproducers for any kind of bug report. Good luck!

sgbeal commented 2 weeks ago

Just FYI: the cause of the problem has been found upstream and a fix is under investigation, In short: on 2023-10-13, internals were added which run pragma integrity_check($TABLENAME) on each new table. When $TABLENAME looks like a number, it is interpreted as such and causes integrity_check() to check the whole db rather than just the new table. Thus, the larger the db, the more prominent the delay is. The forum post:

https://sqlite.org/forum/forumpost/4cf69794d9dfff7c

erlend-aasland commented 2 weeks ago

Thanks a lot for the heads-up, Stephan!