Closed alephnull7 closed 6 hours ago
The use of the mafft wrapper provided by Biopython has been replaced by a subprocess call to mafft. There is no performance difference related to this update and was changed to prevent the depreciation of Application from causing issues in the future.
Yes, that is very sensible! While maintained by a small team of enthusiastic volunteers, Biopython does not have a large developer base. It is a good idea to wrap mafft
via a regular subprocess call.
Extraction of GenBank files has been parallelized. Due to the nature of the operations done, multiprocessing was determined to be the best approach. This did require some refactoring, which can be noted in changes to the PlastidData class, the creation of the PlastidDict class, and the creation of helper methods related to the extraction of a sublist of files as well as individual files. From these new helpers, you will also see that multiprocessing occurs on sublists of the overall file list, as the overhead related to extracting each file in a separate process is less performant than single processing.
I am not sufficiently familiar with the different types of process parallelization to judge which one is the best option. I fully trust your choice here.
Changes that specifically address listed issues
PlastidData
class, the creation of thePlastidDict
class, and the creation of helper methods related to the extraction of a sublist of files as well as individual files. From these new helpers, you will also see that multiprocessing occurs on sublists of the overall file list, as the overhead related to extracting each file in a separate process is less performant than single processing.--order
is available, which relates to the concatenation order of the alignments. By default, the ordering isseq
(sequence), corresponding to the sequence ordering of the features in the first input genome. With parallelization of extraction now occurring, for theseq
option, the first file is extracted before the rest are parallelized to guarantee the specified order. The other option isalpha
(alphabetic) which corresponds to the features being concatenated in alphabetic order.Other changes
mafft
wrapper provided byBiopython
has been replaced by asubprocess
call tomafft
. There is no performance difference related to this update and was changed to prevent the depreciation ofApplication
from causing issues in the future. The parallelization ofmafft
was investigated for improvements, but no obvious changes could be found. There may be changes to the parallelization ofmafft
and other processes that would decrease execution time, but that would probably involve targeting the software for a specific machine configuration.export_fasta
method available toalignm_concat
so that the Nexus file does not have to exist before the FASTA one.