shuzhao-li-lab / asari

asari, metabolomics data preprocessing
Other
38 stars 9 forks source link

Pickle size #96

Closed jmmitc06 closed 3 weeks ago

jmmitc06 commented 3 weeks ago

this may be largely unavoidable; however, the pickles produced by processing certain types of data are very large, often 10x the size of the input mzML file. Ultimately, this limits the size of an analysis that can be performed on a regular machine.

Largely opening this so I can associate it with the branch I will create to address this.

jmmitc06 commented 3 weeks ago

Surprisingly compression works very very well on these pickles. Typical improvement in disk space is ~30x, (e.g., 877.7MB to 25.2MB for gzip) but the increase in processing time is noticeable during extraction. No compression requires 0.28 seconds to dump, compression is considerable slower: 55 seconds for gzip, 45 seconds for lzma. bz2 though requires 10 seconds and offers the best compression of the three with an output size of 17.3MB!!!!! Will implement this for version 2.

jmmitc06 commented 3 weeks ago

Decided to use gzip with compresslevel=1. Seems like the best compromise of space and runtime for our data. This has been implemented in the compress_tracks branch.

shuzhao-li commented 3 weeks ago

The use of pickle is to shift memory limit to I/O. Compression adds performance cost. Asari currently goes back to the to extract peak area per sample. We may find a solution to get rid of pickles altogether.