vericast / conda-mirror

Mirror upstream conda channels
BSD 3-Clause "New" or "Revised" License
71 stars 59 forks source link

concurrent package validation #43

Closed jhoblitt closed 7 years ago

jhoblitt commented 7 years ago

It appears that the vast majority of the run-time is spent validating package digests. In my last couple of test, it took ~1 hour 15mins for a single platform of pkgs/free on an ec2 c4.2xlarge instance. This is acceptable but would likely see a near linear speed up with some simple parallelization.

ericdill commented 7 years ago

You are 100% correct that parallelization would dramatically reduce the runtime. I did play around with adding such a feature but was not pleased with the increased complexity. Since I run this as a nightly cron job, the extra time is not posing a real issue for me at the moment. That being said, if you are interested in parallelizing the validation, I'd be more than happy to review the pull request.

willirath commented 7 years ago

+1 for this feature.

It looks like https://github.com/maxpoint/conda-mirror/blob/master/conda_mirror/conda_mirror.py#L359 could be rewritten to a function:

def func_instead_of_inner_loop_validate(package):

    # ensure the packages in this directory are in the upstream
    # repodata.json
    try:
        package_metadata = package_repodata[package]
        except KeyError:
            logger.warning("%s is not in the upstream index. Removing...",
                           package)
            _remove_package(os.path.join(package_directory, package),
                            reason="Package is not in the repodata index")
        else:
            # validate the integrity of the package, the size of the package and
            # its hashes
            logger.info('Validating package {}'.format(package)
            _validate(os.path.join(package_directory, package),
                      md5=package_metadata.get('md5'),
                      size=package_metadata.get('size'))

(Note that I dropped idx for now.)

This allows for using map(func_instead_of_inner_loop_validate, sorted(local_packages)) instead of the loop.

Then, multiprocessing.map is all we need:

[...]
import multiprocessing
[...]
def _validate_packages(package_repodata, package_directory):
    [...]
    p = multiprocessing.Pool() # careful! This uses _all_ CPUs
    p.map(func_instead_of_inner_loop_validate, sorted(local_packages))
    p.close()
    p.terminate()
    p.join()

[...]
ericdill commented 7 years ago

@willirath Thanks for the interest. That seems like a sensible solution. Would you be able to submit a PR with this code change and we can discuss its details in that PR?

willirath commented 7 years ago

I am working on it: #45. Won't make it past a very rough sketch today. I'll be busy the next two weeks but definitely get back to this at the end of April.

ericdill commented 7 years ago

Awesome 🎉 ! Thanks for the effort. Ping me on the PR when you'd like some feedback.

willirath commented 7 years ago

Thanks for writing this in the first place. This tool already helped a log in getting rid of all my "maintain conda behind a tight firewall" problems.

ericdill commented 7 years ago

Closed by #48 . Thanks for the substantial effort here @willirath