Parsnp tree file does not have distances

Brahmandam-Gayatri commented 2 years ago

Hello, thank you for parsnp and Harvest, it's been a joy running the tool using ~900 genomes.

But when it comes to >1000 genomes, parsnip runs with the following warning but does not generate distances for the tree. Request to look into this issue and suggest a solution. ###################################################################### (parsnp) bash-4.2$ parsnp -r! -d /home/pbl/Desktop/Gayatri_B/genomes/fna_files --out parsnp_results --vcf -p 8 -c 12:42:31 - INFO - |--Parsnp 1.7.4--|

Ref ! 13:02:19 - INFO -

SETTINGS: |-refgenome: autopick |-genomes:
/home/pbl/Desktop/Gayatri_B/genomes/fna_files/GAM263BFi.fna /home/pbl/Desktop/Gayatri_B/genomes/fna_files/GAM260Bi.fna ...1924 more file(s)... /home/pbl/Desktop/Gayatri_B/genomes/fna_files/unplaced_1.fna /home/pbl/Desktop/Gayatri_B/genomes/fna_files/unplaced_2.fna |-aligner: muscle |-outdir: parsnp_results |-OS: Linux |-threads: 8

13:02:19 - INFO - <> 13:02:19 - INFO - No genbank file provided for reference annotations, skipping.. 13:02:41 - INFO - Running Parsnp multi-MUM search and libMUSCLE aligner... 13:54:00 - INFO - Reconstructing core genome phylogeny... 13:54:00 - WARNING - Not enough SNPs to use RaxML. Attempting to use FastTree instead... 13:54:00 - INFO - Aligned 1928 genomes in 1.19 hours 13:54:00 - INFO - Parsnp finished! All output available in parsnp_results ############################################################ This is how the parsnp.tree file looks: ('GCF_022925675.1_ASM2292567v1_genomic.fna':0.0,'GAM263BFi.fna':0.0,'GAM260Bi.fna':0.0,'GAM260ASi.fna':0.0,'GAM254Ai.fna':0.0,'GAM249T.fna':0.0,'GAM246Ai.fna':0.0,'GAM245Ai.fna':0.0,'GAM239Bi.fna':0.0,'GAM231Ai.fna':0.0,'GAM210Bi.fna':0.0,'GAM121Aii.fna':0.0,'GAM120Ai.fna':0.0,'GAM119Bi.fna':0.0,'GAM118Bi.fna':0.0,'GAM117Ai.fna':0.0,'GAM115Ai.fna':0.0,'GAM112Ai.fna':0.0,'GAM105Ai.fna':0.0,'GAM103Bi.fna':0.0,'GAM101Biv.fna':0.0,'GAM100Ai.fna':0.0,'GCF_000274085.2_ASM27408v2_genomic.fna':0.0,'GCF_000476855.1_SA221C_genomic.fna':0.0,'GCF_000476435.1_SA222C_genomic.fna':0.0,'GCF_000600045.1_ASM60004v1_genomic.fna':0.0,'GCF_000273945.1_ASM27394v1_genomic.fna':0.0,'GCF_000273965.1_ASM27396v1_genomic.fna':0.0,'GCF_000273985.1_ASM27398v1_genomic.fna':0.0,'GCF_000274005.2_ASM27400v2_genomic.fna':0.0,'GCF_000274025.1_ASM27402v1_genomic.fna':0.0,'GCF_000274045.2_ASM27404v2_genomic.fna':0.0,'GCF_000274065.1_ASM27406v1_genomic.fna':

################################################################

Thanks, Gayatri Brahmandam.

bkille commented 1 year ago

Hi Gayatri,

My apologies for the late reply. The presence or absence of distances would be due to the FastTree/RaxML tools. If you would like, you can run parsnp with the --skip-phylogeny flag, and then use the resulting parsnp.snps.mblocks file in the output folder to construct the phylogeny.

You are also welcome to share the parsnp.snps.mblocks with me if you would like help debugging the issue!

Best, Bryce

Brahmandam-Gayatri commented 1 year ago

Hello Bryce,

Thank you for taking the time to reply to my query. I have run the above command I posted initially, now including the --skip-phylogeny flag, and did obtain the parsnp.snps.mblocks file. I am attaching the g-drive link as the file format is not supported for uploading here as an attachment, for your reference and consideration. The file does not seem to have anything other than the filenames as the header. Any suggestion on how to proceed with the file for constructing the phylogeny would be highly appreciated.

g-drive link:

https://drive.google.com/file/d/1qEmLuqu4PhCeLrnhal51Hl9HruOFXqcF/view?usp=sharing

Thanks, Gayatri Brahmandam.

bkille commented 1 year ago

@Brahmandam-Gayatri

Thanks for sending that over. The structure of the mblocks file is essentially a fasta file, where the sequence names are the filenames, and the sequence entries are the core SNPs. As you mentioned, there don't seem to be any core SNPs for the sequences.

I'd suggest taking a look at the parsnpAligner.log file in the output directory to see how much of each genome is being covered. You are welcome to share that log file with me as well.

-Bryce

Brahmandam-Gayatri commented 1 year ago

Thank you for taking a look at the parsnp.snps.mblocks file. I am attaching the g-drive link for the entire folder of the analysis for your reference.

g-drive link:

https://drive.google.com/drive/folders/1yNpKwvYtCrZzOHnMhqzKL4EJs4OdBPys

Thanks again, Gayatri Brahmandam.

bkille commented 1 year ago

@Brahmandam-Gayatri

Thanks for sharing. Looking through the output, it looks like there is barely any cluster coverage, i.e. Parsnp is not able to locate a core genome. This can be due to a couple of factors, namely (1) the quality of input sequences and (2) the divergence between the input sequences.

The second case is typically caught by Parsnp's divergence filter, which leads me to believe the issue is (1). A common culprit is the number of ambiguous nucleotides in each genome (Ns). Looking at the GC/AT content of the input sequences compared to their total length, it does look like some sequences have a significant number of ambiguous nucleotides.

As it stands right now, Parsnp does not handle ambiguous nucleotides very well for large input sets, which is something I'm currently working to improve. What I would suggest for the time-being is filtering out input sequences with a significant portion of ambiguous nucleotides.

Best, Bryce

Brahmandam-Gayatri commented 1 year ago

I will try and implement your suggestion and run parsnp on the genomes again, while also waiting for the new release with the add-in you mentioned. The only thing that baffled me at that time was, parsnp worked on less than 1000 genomes and I obtained the tree file, but once it crossed the threshold of 1000 genomes, I started getting the issue I raised here. If the ambiguous bases are a probable cause, how is parsnp able to handle approximately 900 genomes and not more than 1000 genomes? Could it be related to Muscle algorithm too? I am asking this because Parsnp worked like a charm on my trial genomes, hence I want to be able to make my complete dataset get analysed using parsnp if possible. Thanks a lot for your time and consideration of my query.

Thanks, Gayatri Brahmandam.

bkille commented 10 months ago

Hi @Brahmandam-Gayatri,

What is likely the case is that there are genomes in the 1000+ dataset which are not in the 900 dataset that are causing issues with the alignment. While it is normal to see a decay in the size of the reported core genome as the number of genomes increases, the exact behavior of that decay varies from dataset to dataset.

That being said, the new version of Parsnp (2.0) has a --partition flag which breaks the input up into independent runs of parsnp then combines the output. This seems to alleviate the issue you are experiencing, so feel free to give it a try!

Brahmandam-Gayatri commented 8 months ago

Hello Bryce,

Thank you so much for looking into my query. The project I was working on was completed. But I will definitely update Parsnp to the latest version and try out the --partition flag with any new dataset that requires Phylogenomic analyses. I am closing this issue and thanks again for your time and consideration.

Regards,

Gayatri Brahmandam.

marbl / parsnp

Parsnp tree file does not have distances #121