gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
262 stars 33 forks source link

Using PRANK causes thousands of tmp fastas to be written in working directory #204

Closed luke-dt closed 1 year ago

luke-dt commented 1 year ago

There is a weird behavior that I encountered when running panaroo using PRANK as the aligner. Annoyingly, it dumps a lot of tmp files in the working directory while running the generating core genome MSAs... step of the pipeline.

Specifically, it dumps a bunch of *.best.fas files in the working directory and not in the aligned_gene_sequences directory:

$ ll panaroo_bakta/
total 148M
-rw-rw-r--. 1 ldiorio-toth 2.4K Oct 19 16:17 accB.best.fas
-rw-rw-r--. 1 ldiorio-toth 6.8K Oct 19 16:17 accC.best.fas
-rw-rw-r--. 1 ldiorio-toth 8.3K Oct 19 16:37 aceF.best.fas
-rw-rw-r--. 1 ldiorio-toth 1.3K Oct 19 17:20 acpP.best.fas
-rw-rw-r--. 1 ldiorio-toth 9.7K Oct 19 16:04 acs.best.fas
-rw-rw-r--. 1 ldiorio-toth 1.5K Oct 19 18:44 acyP.best.fas
-rw-rw-r--. 1 ldiorio-toth 3.0K Oct 19 16:12 adk.best.fas
-rw-rw-r--. 1 ldiorio-toth 7.7K Oct 19 16:30 ahpA.best.fas
-rw-rw-r--. 1 ldiorio-toth 6.2K Oct 19 16:00 alaC.best.fas
drwxrwsr-x. 2 ldiorio-toth    0 Oct 19 14:55 aligned_gene_sequences/
-rw-rw-r--. 1 ldiorio-toth 4.3K Oct 19 18:28 allE.best.fas
-rw-rw-r--. 1 ldiorio-toth 7.2K Oct 19 16:28 amiB.best.fas
-rw-rw-r--. 1 ldiorio-toth 6.1K Oct 19 17:11 amtB.best.fas
-rw-rw-r--. 1 ldiorio-toth 7.1K Oct 19 17:16 ansP.best.fas

...

-rw-rw-r--. 1 ldiorio-toth 7.0K Oct 19 18:05 ylaK.best.fas
-rw-rw-r--. 1 ldiorio-toth 5.8K Oct 19 18:30 yliI.best.fas
-rw-rw-r--. 1 ldiorio-toth 4.9K Oct 19 18:53 yobV.best.fas
-rw-rw-r--. 1 ldiorio-toth 2.1K Oct 19 18:17 yohJ.best.fas
-rw-rw-r--. 1 ldiorio-toth 1.2K Oct 19 17:39 yozG.best.fas
-rw-rw-r--. 1 ldiorio-toth 3.3K Oct 19 17:58 ypfH.best.fas
-rw-rw-r--. 1 ldiorio-toth 4.5K Oct 19 16:01 ypfJ.best.fas
-rw-rw-r--. 1 ldiorio-toth 4.1K Oct 19 17:45 ypjD.best.fas
-rw-rw-r--. 1 ldiorio-toth 3.2K Oct 19 16:14 yqfA.best.fas
-rw-rw-r--. 1 ldiorio-toth 1.7K Oct 19 17:47 yqfO.best.fas
-rw-rw-r--. 1 ldiorio-toth 2.4K Oct 19 16:31 yqiB.best.fas
-rw-rw-r--. 1 ldiorio-toth 2.0K Oct 19 17:11 yqjE.best.fas
-rw-rw-r--. 1 ldiorio-toth 4.1K Oct 19 16:14 yqjQ.best.fas
-rw-rw-r--. 1 ldiorio-toth 5.0K Oct 19 18:38 yrpB.best.fas
-rw-rw-r--. 1 ldiorio-toth 7.4K Oct 19 16:55 zwf.best.fas
$ ll panaroo_bakta/*.best.fas | wc -l
1151

However, when using MAFFT as the aligner this behavior goes away:

$ ll panaroo_bakta_mafft/
total 142M
drwxrwsr-x. 2 ldiorio-toth 2.0K Oct 19 18:53 aligned_gene_sequences/
-rw-rw-r--. 1 ldiorio-toth  33M Oct 19 16:48 combined_DNA_CDS.fasta
-rw-rw-r--. 1 ldiorio-toth 4.3M Oct 19 16:33 combined_protein_cdhit_out.txt
-rw-rw-r--. 1 ldiorio-toth 1.1M Oct 19 16:32 combined_protein_cdhit_out.txt.clstr
-rw-rw-r--. 1 ldiorio-toth  11M Oct 19 16:48 combined_protein_CDS.fasta
-rw-rw-r--. 1 ldiorio-toth  16M Oct 19 16:49 final_graph.gml
-rw-rw-r--. 1 ldiorio-toth  45M Oct 19 16:48 gene_data.csv
-rw-rw-r--. 1 ldiorio-toth 668K Oct 19 16:48 gene_presence_absence.csv
-rw-rw-r--. 1 ldiorio-toth 879K Oct 19 16:48 gene_presence_absence_roary.csv
-rw-rw-r--. 1 ldiorio-toth 121K Oct 19 16:48 gene_presence_absence.Rtab
-rw-rw-r--. 1 ldiorio-toth 6.5M Oct 19 16:49 pan_genome_reference.fa
-rw-rw-r--. 1 ldiorio-toth  26M Oct 19 16:33 pre_filt_graph.gml
-rw-rw-r--. 1 ldiorio-toth  19K Oct 19 16:48 struct_presence_absence.Rtab
-rw-rw-r--. 1 ldiorio-toth  198 Oct 19 16:48 summary_statistics.txt
drwx--S---. 2 ldiorio-toth 3.2K Oct 19 18:53 tmp6vy528w3/

Is this a bug, or is there some way to set a temporary directory?

gtonkinhill commented 1 year ago

Hi Luke,

Thanks for flagging this. I am working on improving the alignment functionality of Panaroo at the moment and will try and take a look at this early next week.

gtonkinhill commented 1 year ago

Hi Luke,

This should hopefully be fixed in the latest commit to the development branch.

The update also includes new options to perform codon alignment and generates a filtered core genome alignment that we have found produces are more reliable phylogeny.

Once we have done some further testing we will create a new release. In the mean time you can install the updated version with

pip install git+https://github.com/gtonkinhill/panaroo@devel
gtonkinhill commented 1 year ago

This has now been included in v1.3.2