timpalpant / LittleBoxes

A crossword solver
GNU General Public License v3.0
1 stars 0 forks source link

Loading databases from text files is slow #2

Closed timpalpant closed 8 years ago

timpalpant commented 8 years ago

During startup we load several databases into memory to use for solving, such as the clue DB and a dictionary. Loading these from text files is slow (~45s).

Pre-process these resources into a format that can be loaded more quickly (maybe just a Pickle file). It will save time in the long run.

jpalpant commented 8 years ago

I want to learn about Pickling, I'll write this if you don't have it done already!

timpalpant commented 8 years ago

Go for it! I haven't started anything and it would be super helpful.

jpalpant commented 8 years ago

First attempt pickling a ClueDB has failed - pickle/cPickle apparently both struggle with lambda functions? And ClueDB._clue_to_answers is a defaultdict initialized with a lambda function. Looking for solutions to this now

jpalpant commented 8 years ago

Temporary fix by replacing the lambda with an importable function works and has let me do some speed testing. On my machine, loading a database takes... well, see for yourself

Test: Loading Loading times: mean=67.9141685009, std=3.17116179008, median=68.6186280251

Test: Unpickling (highest priority) Loading times: mean=41.9816338062, std=6.7354113375, median=42.4331450462

Run with 10 trials, consistency is good enough that that's more than enough. Definitely an improvement over loading, but not a huge one - I'll look for something better and if I don't find it, pickling it is and we can think about how to fix the lambda problem.

Edit: have not checked to see if pickling and unpickling a ClueDB gives the same ClueDB you started with because eq is not well-defined for ClueDBs. Not worrying about that yet, will later.

jpalpant commented 8 years ago

We have a new challenger (and a likely winner, once I write the tests that check to make sure the serialization works):

Test: Loading Loading times: mean=65.3548296293, std=0.95613521012, median=66.0260429382

Test: MessagePack Loading times: mean=17.1421430111, std=3.55922264153, median=15.5907959938

Test: Unpickle Loading times: mean=27.8242882888, std=3.72844386529, median=27.577742815

And in terms of the size: screen shot 2016-02-28 at 1 45 43 am

MessagePack is much more limited than Pickle in terms of what it can serialize, so you have to write some custom prep code to get the class ready to serialize (and likewise to deserialize it). But for ClueDB it's pretty simple code, at least for now. A few more tests to write and then it'll be ready to pull.

timpalpant commented 8 years ago

Merged #14 -- Much faster, nice job finding MessagePack!

jpalpant commented 8 years ago

Resolved in #14 - can we close? Look at that