Open bosterholz opened 2 years ago
Good Catch! Maybe xargs is easiest way to solve this.
I tried two different algorithms, but they finished really closely while not maxing out IO. We should take a look at parallel implementations as it seems that the one used core could be the bottleneck.
time cksum nr_2022-04-02_mmseqs_taxonomy.tar
2820021559 280000174080 nr_2022-04-02_mmseqs_taxonomy.tar
real 65m15.818s
user 21m25.004s
sys 2m6.584s
time md5sum nr_2022-04-02_mmseqs_taxonomy.tar
35b7bc1a96f0b337c12713d4d3d4b4d3 nr_2022-04-02_mmseqs_taxonomy.tar
real 60m19.030s
user 13m45.796s
sys 2m22.600s
The used md5sum process is single threaded and takes ages calculating a 240GB nr database checksum. It would be nice to use a checksum algorithm/program which can be parallelized to speed this up.