iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

Global variables, multiprocessing and pickling #45

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

Hey @mbhall88 , I finally remembered the reason for me to use global variables for the CLI options and for the update_shared_data variable, which holds a DenovoVariantsDB object, that represents all denovo variants for a run, a super heavy object. The reason was all about performance. When we do multiprocessing in python, one of the steps is pickling the input args on the parent process and unpickling them later on the child process to run the specified function. The issue is that pickling/unpickling involves disk operations and is slow.

However, I think removing the CLI options from global is the right thing to do. We are already pickling/unpickling a variable when running the multiprocessing part. This options variable is very light, and is probably fast to pickle/unpickle, although we have to do it thousands of times.

On the other hand, I don't think the update_shared_data variable should pickled/unpickled everytime we run an update. This data structure is huge and will definitely kill performance. We will spend most of our time pickling/unpickling this object. The trick here was to set this variable as global, so it would exist in the parent process, and would be copied to the child processes when the fork happened when multiprocessing. Although this does not seem we gain efficiency, because we would still copy this heavy data structure to the child process, linux (and I hope MacOS) implement the Copy-on-write optimisation: it just copies the pages from the parent to the child process if the child makes a write to the page. In our case, the update_shared_data variable is read-only from the child, i.e. the child does not change anything, so in our case it is never going to copy the page from the parent to the child process, allowing for effective memory sharing between the parent and the child. I actually had this issue ~1.5y ago: https://github.com/iqbal-lab/pandora1_paper/issues/252#issuecomment-813302128 , which is when I chose to move to the global variable approach. My mistake was not to document this.

So, in summary, the idea of multiprocessing in python is to gain performance, but if we have input/output args that are huge, we have to pickle/unpickle them, and we lose all the performance we wanted to gain. If the child processes never change the input args, then we can efficiently share the args in memory using global variables. Global variables are bad though, so we should use a singleton approach. I think to be sure we should measure the performance on the 20-way using the current code, which will pickle/unpickle the heavy data structure, and then try the singleton approach...

mbhall88 commented 1 year ago

Ahhhh I actually had a thought that this might be the case. Okay, will have to add the the global back in.

leoisl commented 1 year ago

Don't want to benchmark first to be sure this is actually the case?

mbhall88 commented 1 year ago

Nah, all good. Unless you really want to. Everything works now, so I'm happy

leoisl commented 1 year ago

Reopening to remember me to move update_shared_data to a singleton object in https://github.com/iqbal-lab-org/make_prg/pull/42

leoisl commented 1 year ago

Closed via https://github.com/iqbal-lab-org/make_prg/pull/46