justincely / lightcurve_pipeline

pipeline for high level science products of HST 13902
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Implement parallel processing #12

Closed bourque closed 9 years ago

bourque commented 9 years ago

Implementing some parallel processing in the ingest_hstlc pipeline using the multiprocessing library could speed things up considerably.

bourque commented 9 years ago

I created the branch multiprocessing to work on this. Per commit 812de50a2351c983efd05ff3ef1ccb6bc5013655, I added parallel processing of files to ingest. However, this really messes up the logging as the logging of individual files gets all intertwined. I'm not sure what to do about this at the moment.

bourque commented 9 years ago

Another issue that is cropping up from multiprocessing is the creation of directories, specifically:

    if not os.path.exists(directory):
        logging.info('\tCreating directory {}'.format(directory))
        os.mkdir(directory)
        set_permissions(directory)

When two separate processes are trying to make the same directory at the exact same time, by the time the code passes the if statement for one process, the directory is already created by the other process. I found a S/O solution that I'm going to try out.

EDIT: This seems to have fixed things:

    if not os.path.exists(directory):
        try:
            logging.info('\tCreating directory {}'.format(directory))
            os.mkdir(directory)
            set_permissions(directory)
        except OSError:
            pass
bourque commented 9 years ago

Commit ef8fe49a84bf1855d8387699243a308801559d81 introduces multiprocessing of composite lightcurve creation

bourque commented 9 years ago

Merge request #13 implements multiprocessing into the pipeline. @justincely perhaps you could take a look and make sure i'm not doing anything crazy, and merge if it's ok.

bourque commented 9 years ago

Pull request #14 fixed a problem in which calls to the database were breaking during multiprocessing with 20+ cores. The appropriate code was changed to initialize local sessions for each database call instead of using a global session.

This worked well for a re-ingest of the filesystem over 30 cores. So I am closing this issue for now.