vitorpio / pybible-cli

Bible reference CLI application
MIT License
13 stars 3 forks source link

Unpickle bible file is taking too long and does seen efficient #2

Closed vitorpio closed 4 years ago

vitorpio commented 4 years ago

Unpickling the bible structure is taking some time and it does not seen efficient to load the whole pickle file just to quote a single verse in some cases.

tonybaloney commented 4 years ago

Pickle has lots of downsides.

First (the biggest) is the security of it. Pickle has a custom reduce API which means when you deserialize pickle files it can run any method in the standard library see more

The second is performance. Pickle itself is quite fast, but you're loading a large file into a single object, this is going to use a lot of memory and put a lot of load onto the disk.

Because the CLI is only typically loading a small part of the file, it's inefficient to deserialize the entire bible. Instead, I recommend indexing it and then splitting it into smaller parts. Use the file system as a database.

Option 1: File/Folder based structure

You can use any serialization format for this, it doesn't matter. JSON is more secure, there are faster implementations of json like orjson. You can stick with pickle if you want to.

  1. Create an index/lookup file and when something is requested, it returns the target file
  2. Deserialize that particular file. It could be a section (e.g. Gospels, Letters of Paul), or the specific book

Your directory might look like:

-- books/
-----KJV/
--------index.json
--------books/
-----------genesis.json
-----------exodus.json
-----ELT/
--------index.json
--------books/
-----------genesis.json
-----------exodus.json

Option 2: Use SQLite

tonybaloney commented 4 years ago

If you don't want the package to have tons of files, you can zip the books and then unzip either inline or when it's installed.

tonybaloney commented 4 years ago

NTLK is a good example implementation of this type of model https://github.com/nltk/nltk/blob/develop/nltk/downloader.py

vitorpio commented 4 years ago

I will run some time comparison tests to check if the cli is really slow (more than a sec to load). But i will split the pickle files anyways and add serialized pickles to avoid tempering.

vitorpio commented 4 years ago

@tonybaloney i have created a branch(replace_pickle_for_json) now using json, can you please check it ? The next step is splitting into smaller files for faster performance.

vitorpio commented 4 years ago

@tonybaloney Results after taking the mean of 10 runs:

pickle -> 1.4725032329559327 json -> 0.8619775295257568

SO JSON IS THE WINNER !

tonybaloney commented 4 years ago

@vitorpio can you raise a PR for it?

Do "New Pull Request" and change the compare branch to replace_pickle_for_json. It'll make it easier to do code review