lsst-uk / csd3-echo-somerville

Code to backup from CSD3 to Echo S3, curate at STFC cloud and expose to Somerville
Apache License 2.0
0 stars 0 forks source link

Parallel filesystem traversal #74

Open davedavemckay opened 1 month ago

davedavemckay commented 1 month ago

Looks surprisingly easy. From: https://stackoverflow.com/questions/29614584/parallel-directory-walk-python Rewrite the below to use Dask:

import itertools
import multiprocessing

def worker(filename):
    pass   # do something here!

def main():
    with multiprocessing.Pool(48) as Pool: # pool of 48 processes

        walk = os.walk("some/path")
        fn_gen = itertools.chain.from_iterable((os.path.join(root, file)
                                                for file in files)
                                               for root, dirs, files in walk)

        results_of_work = pool.map(worker, fn_gen) # this does the parallel processing
davedavemckay commented 1 month ago

Issue is the code that would go under worker(filename) is complicated.