openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 66 forks source link

How to run pyensembl using multiple threads? #251

Open damianosmel opened 3 years ago

damianosmel commented 3 years ago

Dear pyensembl team,

First, thank you again for developing pyensembl :)

In my application, I have a class that uses the pyensembl extensively. I initialize this class as follows:

class AnnotateVariants:
    def __init__(self,..):
       self.ensembl_data = EnsemblRelease(75) 
       ...

I would need to allow multiple threads to use the pyenseml object, in order to use the functions to annotate variants in parallel. For some internal reasons, I use the multiprocessing.dummy library, thus I use threads and not processes.

In my current implementation I assign to each distinct thread a new instance of the AnnotateVariants class. However, looking at the log file I can see that the threads do not run in parallel. That is, say I start with a pool of 16 threads, 5 of them run in parallel and the others wait. Then the next subgroup of threads run and so on.

Is this related to the constructor of pyensembl (EnsemblRelease) as I see that the constructor gives the same ensembl release instance if it's already cached (docs)?

If this is true then the same connection to the sqlite 3 db instance is given to all the threads in pool, so the threads are in race-condition. That's my interpretation of the observation. Please let me know your ideas.

Second, please advice me on your way to run pyensembl in a fully parallel way.

My pyensembl version is 1.9.0.

Thanks a lot!

damianosmel commented 3 years ago

Dear developers team,

any update on this question?

Thank you, Damianos