Closed vitorpio closed 4 years ago
Pickle has lots of downsides.
First (the biggest) is the security of it. Pickle has a custom reduce API which means when you deserialize pickle files it can run any method in the standard library see more
The second is performance. Pickle itself is quite fast, but you're loading a large file into a single object, this is going to use a lot of memory and put a lot of load onto the disk.
Because the CLI is only typically loading a small part of the file, it's inefficient to deserialize the entire bible. Instead, I recommend indexing it and then splitting it into smaller parts. Use the file system as a database.
You can use any serialization format for this, it doesn't matter. JSON is more secure, there are faster implementations of json like orjson. You can stick with pickle if you want to.
Your directory might look like:
-- books/
-----KJV/
--------index.json
--------books/
-----------genesis.json
-----------exodus.json
-----ELT/
--------index.json
--------books/
-----------genesis.json
-----------exodus.json
If you don't want the package to have tons of files, you can zip the books and then unzip either inline or when it's installed.
NTLK is a good example implementation of this type of model https://github.com/nltk/nltk/blob/develop/nltk/downloader.py
I will run some time comparison tests to check if the cli is really slow (more than a sec to load). But i will split the pickle files anyways and add serialized pickles to avoid tempering.
@tonybaloney i have created a branch(replace_pickle_for_json) now using json, can you please check it ? The next step is splitting into smaller files for faster performance.
@tonybaloney Results after taking the mean of 10 runs:
pickle -> 1.4725032329559327 json -> 0.8619775295257568
SO JSON IS THE WINNER !
@vitorpio can you raise a PR for it?
Do "New Pull Request" and change the compare branch to replace_pickle_for_json
. It'll make it easier to do code review
Unpickling the bible structure is taking some time and it does not seen efficient to load the whole pickle file just to quote a single verse in some cases.