Open johnlees opened 3 years ago
This is definitely a bottleneck in the pipeline. It is not trivial to parallelise as the processing of one paralogous family can depend on the result of another.
The algorithm initially attempts to collapse paralogous genes by identifying the nearest neighbour of a paralogous gene within the graph and collapsing them if they are from seperate samples. This is tricky to parallelise as you need to be sure that a previous step has not already included a gene from the same sample in that cluster.
When this approach fails the algorithm defaults to using gene context. This stage would be much easier to parallelise and is already faster as it does not rely on calculating shortest paths. I was hoping to experiment with how much the results would change if we only used this approach but might not get a chance for a couple of months.
To keep using the first approach, could you use some shared memory to mark genes which have been included? The shared memory manager in python3.8 has made this kind of thing easier to get to work in poppunk
Just a suggestion though, sounds like this was already on your radar!
Running and finding the
Processing paralogs...
stage quite slow:Looks single threaded in the code – is it possible to do this in a multiprocessing loop, or are there interactions which make this difficult?