metagenomics / metagenomics-tk

GNU Affero General Public License v3.0
3 stars 1 forks source link

Improved checksum calculations #187

Open bosterholz opened 2 years ago

bosterholz commented 2 years ago

The used md5sum process is single threaded and takes ages calculating a 240GB nr database checksum. It would be nice to use a checksum algorithm/program which can be parallelized to speed this up.

pbelmann commented 2 years ago

Good Catch! Maybe xargs is easiest way to solve this.

bosterholz commented 2 years ago

I tried two different algorithms, but they finished really closely while not maxing out IO. We should take a look at parallel implementations as it seems that the one used core could be the bottleneck.

time cksum nr_2022-04-02_mmseqs_taxonomy.tar
2820021559 280000174080 nr_2022-04-02_mmseqs_taxonomy.tar

real    65m15.818s
user    21m25.004s
sys     2m6.584s

time md5sum nr_2022-04-02_mmseqs_taxonomy.tar                                                                                               
35b7bc1a96f0b337c12713d4d3d4b4d3  nr_2022-04-02_mmseqs_taxonomy.tar

real    60m19.030s
user    13m45.796s
sys     2m22.600s