gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

Parallelise collapse_paralogs? #126

Open johnlees opened 3 years ago

johnlees commented 3 years ago

Running and finding the Processing paralogs... stage quite slow:

Processing paralogs...
  2%|██▉                                                    | 92/3815 [1:37:47<153:39:25, 148.58s/it]

Looks single threaded in the code – is it possible to do this in a multiprocessing loop, or are there interactions which make this difficult?

gtonkinhill commented 3 years ago

This is definitely a bottleneck in the pipeline. It is not trivial to parallelise as the processing of one paralogous family can depend on the result of another.

The algorithm initially attempts to collapse paralogous genes by identifying the nearest neighbour of a paralogous gene within the graph and collapsing them if they are from seperate samples. This is tricky to parallelise as you need to be sure that a previous step has not already included a gene from the same sample in that cluster.

When this approach fails the algorithm defaults to using gene context. This stage would be much easier to parallelise and is already faster as it does not rely on calculating shortest paths. I was hoping to experiment with how much the results would change if we only used this approach but might not get a chance for a couple of months.

johnlees commented 3 years ago

To keep using the first approach, could you use some shared memory to mark genes which have been included? The shared memory manager in python3.8 has made this kind of thing easier to get to work in poppunk

Just a suggestion though, sounds like this was already on your radar!