Closed jhoblitt closed 7 years ago
You are 100% correct that parallelization would dramatically reduce the runtime. I did play around with adding such a feature but was not pleased with the increased complexity. Since I run this as a nightly cron job, the extra time is not posing a real issue for me at the moment. That being said, if you are interested in parallelizing the validation, I'd be more than happy to review the pull request.
+1 for this feature.
It looks like https://github.com/maxpoint/conda-mirror/blob/master/conda_mirror/conda_mirror.py#L359 could be rewritten to a function:
def func_instead_of_inner_loop_validate(package):
# ensure the packages in this directory are in the upstream
# repodata.json
try:
package_metadata = package_repodata[package]
except KeyError:
logger.warning("%s is not in the upstream index. Removing...",
package)
_remove_package(os.path.join(package_directory, package),
reason="Package is not in the repodata index")
else:
# validate the integrity of the package, the size of the package and
# its hashes
logger.info('Validating package {}'.format(package)
_validate(os.path.join(package_directory, package),
md5=package_metadata.get('md5'),
size=package_metadata.get('size'))
(Note that I dropped idx
for now.)
This allows for using map(func_instead_of_inner_loop_validate, sorted(local_packages))
instead of the loop.
Then, multiprocessing.map
is all we need:
[...]
import multiprocessing
[...]
def _validate_packages(package_repodata, package_directory):
[...]
p = multiprocessing.Pool() # careful! This uses _all_ CPUs
p.map(func_instead_of_inner_loop_validate, sorted(local_packages))
p.close()
p.terminate()
p.join()
[...]
@willirath Thanks for the interest. That seems like a sensible solution. Would you be able to submit a PR with this code change and we can discuss its details in that PR?
I am working on it: #45. Won't make it past a very rough sketch today. I'll be busy the next two weeks but definitely get back to this at the end of April.
Awesome 🎉 ! Thanks for the effort. Ping me on the PR when you'd like some feedback.
Thanks for writing this in the first place. This tool already helped a log in getting rid of all my "maintain conda behind a tight firewall" problems.
Closed by #48 . Thanks for the substantial effort here @willirath
It appears that the vast majority of the run-time is spent validating package digests. In my last couple of test, it took ~1 hour 15mins for a single platform of
pkgs/free
on an ec2 c4.2xlarge instance. This is acceptable but would likely see a near linear speed up with some simple parallelization.