Open simonwiles opened 4 years ago
+1 to making this something that users can customize. An optional argument to the Database
constructor would be a neat way to do this.
I think there's a terrifying way that we could find this value... we could perform a binary search for it! Open up a memory connection and try running different bulk inserts against it and catch the exceptions - then adjust and try again.
My hunch is that we could perform just 2 or 3 probes (maybe against carefully selected values) to find the highest value that works. If this process took less than a few ms to run I'd be happy to do it automatically when the class is instantiated (and let users disable that automatic proving by setting a value using the constructor argument).
On my Mac right now I'm seeing a limit of 500,000:
% sqlite3 -cmd ".limits variable_number"
variable_number 500000
Tried this experiment (not proper binary search, it only searches downwards):
import sqlite3
db = sqlite3.connect(":memory:")
def tryit(n):
sql = "select 1 where 1 in ({})".format(", ".join("?" for i in range(n)))
db.execute(sql, [0 for i in range(n)])
def find_limit(min=0, max=5_000_000):
value = max
while True:
print('Trying', value)
try:
tryit(value)
return value
except:
value = value // 2
Running find_limit()
with those default parameters takes about 1.47s on my laptop:
In [9]: %timeit find_limit()
Trying 5000000
Trying 2500000...
1.47 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Interestingly the value it suggested was 156250 - suggesting that the macOS sqlite3
binary with a 500,000 limit isn't the same as whatever my Python is using here.
%timeit find_limit(max=1_000_000)
took 378ms on my laptop
%timeit find_limit(max=500_000)
took 197ms
%timeit find_limit(max=200_000)
reported 53ms per loop
%timeit find_limit(max=100_000)
reported 26.8ms per loop.
All of these are still slow enough that I'm not comfortable running this search for every time the library is imported. Allowing users to opt-in to this as a performance enhancement might be better.
I tried changing the hard-coded value from 999 to 156_250 and running sqlite-utils insert
against a 500MB CSV file, with these results:
(sqlite-utils) sqlite-utils % time sqlite-utils insert slow-ethos.db ethos ../ethos-datasette/ethos.csv --no-headers
[###################################-] 99% 00:00:00sqlite-utils insert slow-ethos.db ethos ../ethos-datasette/ethos.csv
44.74s user 7.61s system 92% cpu 56.601 total
# Increased the setting here
(sqlite-utils) sqlite-utils % time sqlite-utils insert fast-ethos.db ethos ../ethos-datasette/ethos.csv --no-headers
[###################################-] 99% 00:00:00sqlite-utils insert fast-ethos.db ethos ../ethos-datasette/ethos.csv
39.40s user 5.15s system 96% cpu 46.320 total
Not as big a difference as I was expecting.
... but it looks like batch_size
is hard-coded to 100, rather than None
- which means it's not being calculated using that value:
And
I fiddled around and replaced that line with batch_size = SQLITE_MAX_VARS // num_columns
- which evaluated to 10416
for this particular file. That got me this:
40.71s user 1.81s system 98% cpu 43.081 total
43s is definitely better than 56s, but it's still not as big as the ~26.5s to ~3.5s improvement described by @simonwiles at the top of this issue. I wonder what I'm missing here.
I came across this while about to open an issue and PR against the documentation for
batch_size
, which is a bit incomplete.As mentioned in #145, while:
it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.
In working to understand what
batch_size
was actually doing and why, I realized that by settingSQLITE_MAX_VARS
indb.py
to match the value my sqlite was compiled with (I'm on Debian), I was able to decrease the time toinsert_all()
my test data set (~128k records across 7 tables) from ~26.5s to ~3.5s. Given that this about .05% of my total dataset, this is time I am keen to save...Unfortunately, it seems that
sqlite3
in the python standard library doesn't expose theget_limit()
C API (even thoughpysqlite
used to), so it's hard to know what value sqlite has been compiled with (note that this could mean, I suppose, that it's less than 999, and even hardcodingSQLITE_MAX_VARS
to the conservative default might not be adequate. It can also be lowered -- but not raised -- at runtime). The best I could come up with isecho "" | sqlite3 -cmd ".limits variable_number"
(only available insqlite >= 2015-05-07 (3.8.10)
).Obviously this couldn't be relied upon in
sqlite_utils
, but I wonder what your opinion would be about exposingSQLITE_MAX_VARS
as a user-configurable parameter (with suitable "here be dragons" warnings)? I'm going to go ahead and monkey-patch it for my purposes in any event, but it seems like it might be worth considering.