Make database updates blazing fast

graue commented 11 years ago

tl;dr: they're too slow, make 'em fast. (Original rant follows:)

I just added a commit which prints time taken for create/update using this method:

start_time = datetime.today()  # at top of create/update function
# ... function body (including final db commit at the end) ...
end_time = datetime.today()    # at bottom of create/update function
print (end_time - start_time).total_seconds()

I tested the time to process 664 tracks on the local SSD on a late-2012 MacBook Air 13". I compared the post-DB overhaul commit 30dff9d, "Print time taken to process a DB update", run with DEBUG=True ./manage.py update, and the pre-DB overhaul commit e5dad7a "Fix broken DB_URI option", run with DEBUG=True ./manage.py createdb. (To the old commit, I added a timer that works the same as above.) All figures are the average of 5 trials.

5.14 sec (avg) to create DB first time, post-overhaul, no existing DB
0.62 sec (avg) to update existing DB, post-overhaul, no files changed
0.85 sec (avg) to create DB, pre-overhaul

So modest speedup (27%) in the case where nothing changed, but 500% worse for initial create! This is IMHO obviously not a good tradeoff. When nothing's changed we should get a much better speedup than 27%.

I'd like updates with no changes to be at least 50% faster than pre-overhaul creates, and initial creates to be at most 25% slower. Future changes should be guided by speed testing and profiling.

graue commented 11 years ago

So far it seems that DB commits are very expensive. (By removing the db.session.commit in aggregate_metadata, I got initial creates down to 2.07 sec average, though this causes incorrect behavior — see #30. Updates were not significantly affected.) In general I suspect looking up the album for every track is holding us back, and maintaining a Python dict of known albums may help.

I pushed a branch called createdb-baseline which is the old pre-overhaul commit I referred to above, with a status line and timer added.

tippenein commented 11 years ago

I realize my overhaul is not ideal. Don't worry about calling it bad when you have the numbers to prove it is bad :)

My thought is we need to use sqlalchemy better than we do now. There shouldn't be a commit within aggregate_metadata, but without it we have to figure out some way of creating albums within the session. (When I was trying this initially it wouldn't find the albums unless I committed each one)

Perhaps if we pass the session? Like, db.session.add(album) and return db.session from aggregate_metadata.

I'm not sure.. Maybe aggregate_metadata was ill conceived to begin with.

graue commented 9 years ago

Eh. I just run updates overnight now. Not the end of the world. Will create a new issue if I have concrete ideas I feel like trying out.

graue / potsfyi

Make database updates blazing fast #31