Handle more than a few thousand recipes

Converted from SourceForge issue 1198293, submitted by SourceForge user calvin on 2005-05-09 13:36:01 UTC.

The current (0.8.3.4) implementation can handle no more than 5000-6000 recipes. With more than that, everything (startup, import, browsing) is getting very slow, up to the point where the program is unusable. If you want, I can upload a ~40MB recipe DB with several ten thousand recipes. On such a DB startup takes several minutes on my machine.

Looking through the source code I found several culprits

1) Primary keys have type char() This is clearly a showstopper for optimizations. Primary keys should be integers, esp. when they are referenced from other tables or when indexes are created upon them. Solution: replace "CHAR()" type with "INTEGER"

2) Unique ID generation is done in Python, not in the DB This is especially slow when thousands of recipes/ingredients are stored in the DB. Solution: use AUTOINCREMENT for primary keys (sqlite supports this). Then you do not need the new_id() method anymore.

3) Not all DB tables have primary keys When a table has no primary key (and no index created on them), all join operations are full table scans which is slow.

4) encoding detection is slow for large files The encoding detection when importing is slow since it calls a lot of text.encode() where text is potentially very large. Solution: make encoding detection only on demand, ie. when the user pushes a button. Otherwise the user can also manually select the encoding without detection.

Other optimization tips:

1) use driver-specifc pragmas, for example for sqlite I recommend PRAGMA cache_size = 20000; -- default is 2000 PRAGMA temp_store = MEMORY; -- default is DEFAULT This uses more memory, but not much more. Each cache page uses 1.5kB of memory, so 20000 cache pages use ~20MB which is not very much nowadays.

2) support profiling I'd like to be able to start gourmet with a "--profile" option to gather profiling data. The --profile option should automatically disable psyco, and gather profile data with: import profile profile.run("main()", "gourmet_profile.dat")

Finally, I can offer you to implement some of these suggestions, but since the DB stuff is pretty lowlevel this should probably handled by the main author :)

Submitted by SourceForge user thomas_hinkle on 2005-05-10 16:13:09 UTC.

Logged In: YES user_id=1030390

Finally, I can offer you to implement some of these suggestions,

Much appreciated!

but since the DB stuff is pretty lowlevel this should probably handled by the main author :)

I see no reason this should be true :)

I say this half in jest, but truth be told -- the DB stuff should be pretty pluggable -- in other words, if you can make the low level stuff act the same as what's there currently (but faster), it should just work.

I tried to start abstracting the DB backend stuff in the backends/ directory of the Gourmet tree -- rdatabase defines generic stuff which is then implemented by rmetakit (the only one currently working), rsqlite (did work at one time), and rmysql (did work at one time).

I temporarily abandoned the sql backends due to threading concerns, but Gourmet now works fine with the "GourmetFauxThreads"module, meaning we don't need threading support anyway, so there's no reason not to play with other backends.

I implemented basic code to make SQL (which I don't know) look like metakit (which Gourmet was built with) -- that code is in PythonicSQL, pythonic_sqlite, etc. All though this stuff is very low level, it would definitely be better if it were handled by someone other than the author.

So, e-mail me if you want to start tinkering with individual pieces and want some more guidance. I've been trying to improve the documentation in the code piece by piece to make this do-able, and I think the fastest way to improvements would be to get some other eyes on the code throughout (as this post suggests!).

Tom

thinkle / gourmet

Handle more than a few thousand recipes #201