vericast / conda-mirror

Mirror upstream conda channels
BSD 3-Clause "New" or "Revised" License
72 stars 60 forks source link

Improve validation performance #69

Closed diogocp closed 6 years ago

diogocp commented 6 years ago

Do not attempt to extract the tarball if the MD5 matches. This makes validation an order of magnitude faster (in one test, from around 12 hours to less than 12 minutes).

codecov-io commented 6 years ago

Codecov Report

Merging #69 into master will decrease coverage by 1.94%. The diff coverage is 20%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   96.88%   94.94%   -1.95%     
==========================================
  Files           2        2              
  Lines         257      257              
==========================================
- Hits          249      244       -5     
- Misses          8       13       +5
Impacted Files Coverage Δ
conda_mirror/conda_mirror.py 94.88% <20%> (-1.97%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7f92926...b3648be. Read the comment docs.

parente commented 6 years ago

@diogocp Thanks for the PR. I've looked through it and the change of logic to validate md5 and size before doing an extract makes sense to me.

Test coverage dropped quite a bit. I'd like to look at the tests a bit to make sure validation failures are being shortcircuited for the right reason.

parente commented 6 years ago

Got myself re-acquainted with the tests and project. I'm comfy merging what's here now. Thanks for the PR @diogocp. 🍰

magnuhho commented 6 years ago

Wouldn't it be better to check the size before the md5? Size check is faster than hash calculation and if there is a size mismatch chances are pretty small the md5 is right. The other way around, chances are slim the size is NOT correct if the md5 validates.

diogocp commented 6 years ago

Yes, it would be reasonable to do it that way. However, I expect that you will almost never see a validation failure, so it won't make much of a difference either way.

What really slowed down the process a lot was decompressing every single package.