Closed bourque closed 9 years ago
I created the branch multiprocessing
to work on this. Per commit 812de50a2351c983efd05ff3ef1ccb6bc5013655, I added parallel processing of files to ingest. However, this really messes up the logging as the logging of individual files gets all intertwined. I'm not sure what to do about this at the moment.
Another issue that is cropping up from multiprocessing is the creation of directories, specifically:
if not os.path.exists(directory):
logging.info('\tCreating directory {}'.format(directory))
os.mkdir(directory)
set_permissions(directory)
When two separate processes are trying to make the same directory at the exact same time, by the time the code passes the if statement for one process, the directory is already created by the other process. I found a S/O solution that I'm going to try out.
EDIT: This seems to have fixed things:
if not os.path.exists(directory):
try:
logging.info('\tCreating directory {}'.format(directory))
os.mkdir(directory)
set_permissions(directory)
except OSError:
pass
Commit ef8fe49a84bf1855d8387699243a308801559d81 introduces multiprocessing of composite lightcurve creation
Merge request #13 implements multiprocessing into the pipeline. @justincely perhaps you could take a look and make sure i'm not doing anything crazy, and merge if it's ok.
Pull request #14 fixed a problem in which calls to the database were breaking during multiprocessing with 20+ cores. The appropriate code was changed to initialize local sessions for each database call instead of using a global session.
This worked well for a re-ingest of the filesystem over 30 cores. So I am closing this issue for now.
Implementing some parallel processing in the
ingest_hstlc
pipeline using themultiprocessing
library could speed things up considerably.