michaelgruenstaeudl / PlastomeBurstAndAlign

Extracting and aligning genes, introns, and intergenic spacers across plastid genomes using associative arrays
BSD 3-Clause "New" or "Revised" License
0 stars 5 forks source link

Addition of parallelization to multiple procedures & user-defined concatenation order #21

Closed alephnull7 closed 6 hours ago

alephnull7 commented 4 days ago

Changes that specifically address listed issues

Other changes

michaelgruenstaeudl commented 6 hours ago

The use of the mafft wrapper provided by Biopython has been replaced by a subprocess call to mafft. There is no performance difference related to this update and was changed to prevent the depreciation of Application from causing issues in the future.

Yes, that is very sensible! While maintained by a small team of enthusiastic volunteers, Biopython does not have a large developer base. It is a good idea to wrap mafft via a regular subprocess call.

michaelgruenstaeudl commented 6 hours ago

Extraction of GenBank files has been parallelized. Due to the nature of the operations done, multiprocessing was determined to be the best approach. This did require some refactoring, which can be noted in changes to the PlastidData class, the creation of the PlastidDict class, and the creation of helper methods related to the extraction of a sublist of files as well as individual files. From these new helpers, you will also see that multiprocessing occurs on sublists of the overall file list, as the overhead related to extracting each file in a separate process is less performant than single processing.

I am not sufficiently familiar with the different types of process parallelization to judge which one is the best option. I fully trust your choice here.