simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.67k stars 111 forks source link

SQLITE_MAX_VARS maybe hard-coded too low #147

Open simonwiles opened 4 years ago

simonwiles commented 4 years ago

I came across this while about to open an issue and PR against the documentation for batch_size, which is a bit incomplete.

As mentioned in #145, while:

SQLITE_MAX_VARIABLE_NUMBER ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.

In working to understand what batch_size was actually doing and why, I realized that by setting SQLITE_MAX_VARS in db.py to match the value my sqlite was compiled with (I'm on Debian), I was able to decrease the time to insert_all() my test data set (~128k records across 7 tables) from ~26.5s to ~3.5s. Given that this about .05% of my total dataset, this is time I am keen to save...

Unfortunately, it seems that sqlite3 in the python standard library doesn't expose the get_limit() C API (even though pysqlite used to), so it's hard to know what value sqlite has been compiled with (note that this could mean, I suppose, that it's less than 999, and even hardcoding SQLITE_MAX_VARS to the conservative default might not be adequate. It can also be lowered -- but not raised -- at runtime). The best I could come up with is echo "" | sqlite3 -cmd ".limits variable_number" (only available in sqlite >= 2015-05-07 (3.8.10)).

Obviously this couldn't be relied upon in sqlite_utils, but I wonder what your opinion would be about exposing SQLITE_MAX_VARS as a user-configurable parameter (with suitable "here be dragons" warnings)? I'm going to go ahead and monkey-patch it for my purposes in any event, but it seems like it might be worth considering.

simonw commented 4 years ago

+1 to making this something that users can customize. An optional argument to the Database constructor would be a neat way to do this.

I think there's a terrifying way that we could find this value... we could perform a binary search for it! Open up a memory connection and try running different bulk inserts against it and catch the exceptions - then adjust and try again.

My hunch is that we could perform just 2 or 3 probes (maybe against carefully selected values) to find the highest value that works. If this process took less than a few ms to run I'd be happy to do it automatically when the class is instantiated (and let users disable that automatic proving by setting a value using the constructor argument).

simonw commented 3 years ago

On my Mac right now I'm seeing a limit of 500,000:

% sqlite3 -cmd ".limits variable_number"
     variable_number 500000
simonw commented 3 years ago

Tried this experiment (not proper binary search, it only searches downwards):

import sqlite3

db = sqlite3.connect(":memory:")

def tryit(n):
    sql = "select 1 where 1 in ({})".format(", ".join("?" for i in range(n)))
    db.execute(sql, [0 for i in range(n)])

def find_limit(min=0, max=5_000_000):
    value = max
    while True:
        print('Trying', value)
        try:
            tryit(value)
            return value
        except:
            value = value // 2

Running find_limit() with those default parameters takes about 1.47s on my laptop:

In [9]: %timeit find_limit()
Trying 5000000
Trying 2500000...
1.47 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Interestingly the value it suggested was 156250 - suggesting that the macOS sqlite3 binary with a 500,000 limit isn't the same as whatever my Python is using here.

simonw commented 3 years ago

%timeit find_limit(max=1_000_000) took 378ms on my laptop

%timeit find_limit(max=500_000) took 197ms

%timeit find_limit(max=200_000) reported 53ms per loop

%timeit find_limit(max=100_000) reported 26.8ms per loop.

All of these are still slow enough that I'm not comfortable running this search for every time the library is imported. Allowing users to opt-in to this as a performance enhancement might be better.

simonw commented 3 years ago

I tried changing the hard-coded value from 999 to 156_250 and running sqlite-utils insert against a 500MB CSV file, with these results:

(sqlite-utils) sqlite-utils % time sqlite-utils insert slow-ethos.db ethos ../ethos-datasette/ethos.csv --no-headers
  [###################################-]   99%  00:00:00sqlite-utils insert slow-ethos.db ethos ../ethos-datasette/ethos.csv
44.74s user 7.61s system 92% cpu 56.601 total
# Increased the setting here
(sqlite-utils) sqlite-utils % time sqlite-utils insert fast-ethos.db ethos ../ethos-datasette/ethos.csv --no-headers
  [###################################-]   99%  00:00:00sqlite-utils insert fast-ethos.db ethos ../ethos-datasette/ethos.csv
39.40s user 5.15s system 96% cpu 46.320 total

Not as big a difference as I was expecting.

simonw commented 3 years ago

... but it looks like batch_size is hard-coded to 100, rather than None - which means it's not being calculated using that value:

https://github.com/simonw/sqlite-utils/blob/1f49f32814a942fa076cfe5f504d1621188097ed/sqlite_utils/db.py#L704

And

https://github.com/simonw/sqlite-utils/blob/1f49f32814a942fa076cfe5f504d1621188097ed/sqlite_utils/db.py#L1877

simonw commented 3 years ago

I fiddled around and replaced that line with batch_size = SQLITE_MAX_VARS // num_columns - which evaluated to 10416 for this particular file. That got me this:

   40.71s user 1.81s system 98% cpu 43.081 total

43s is definitely better than 56s, but it's still not as big as the ~26.5s to ~3.5s improvement described by @simonwiles at the top of this issue. I wonder what I'm missing here.