snayfach / MIDAS

An integrated pipeline for estimating strain-level genomic variation from metagenomic data
http://dx.doi.org/10.1101/gr.201863.115
GNU General Public License v3.0
124 stars 52 forks source link

Issue with strain phylogeny #53

Closed kylecampbell3 closed 6 years ago

kylecampbell3 commented 7 years ago

Hi Stephen,

Thank you for your program!

I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?

Thanks for your help,

Kyle Campbell

snayfach commented 7 years ago

Hi Kyle,

Unfortunately at the current time, MIDAS only produces strain phylogenies based on consensus sequences. As you observe, that means that you will get only 1 leaf per metagenome. These phylogenies will be accurate only if the intra-metagenome diversity is low.

I would suggest performing some followup analysis to assess the intra-metagenome and inter-metagenome diversity levels. The script snp_diversity.py should help in that regard. If the intra-diversity is very low, you have clonal populations and the phylogenetic tree based on consensus sequences is likely accurate. Please let me know if you have any questions.

In the future I may add code to estimate strain phylogenies that take into account multiple strains/species/sample.

Best, Stephen

On Wed, May 3, 2017 at 3:43 PM, kylecampbell3 notifications@github.com wrote:

Hi Stephen,

Thank you for your program!

I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?

Thanks for your help,

Kyle Campbell

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/snayfach/MIDAS/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/ACAbrWWnWlQsYw_esx3imKT-M-HRCsOmks5r2QMLgaJpZM4NQCtz .

marianne-LotterJones commented 7 years ago

Hi Stephen, I am working with Kyle on this project. Thank you for your suggestion! However, when we enter the code, we keep coming across this error:

ubuntu@ip-172-31-61-218:~/MIDAS$ python /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity

MIDAS: Metagenomic Intra-species Diversity Analysis System version 1.2.1; github.com/snayfach/MIDAS Copyright (C) 2015-2016 Stephen Nayfach Freely distributed under the GNU General Public License (GPLv3)

=============================== ===========Parameters=========== Command: /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity Script: snp_diversity.py Input directory: /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ Output file: /home/ubuntu/MIDAS/snp_diversity Diversity options: genomic_type: genome-wide sample_type: per-sample site_type: ['NC', '1D', '2D', '3D', '4D'] weight_by_depth: False rand_reads: None replace_reads: False rand_samples: None rand_sites: None Sample filters: sample_depth: 0.0 fract_cov: 0.0 max_samples: inf keep_samples: None exclude_samples: None Site filters: site_depth: 2 site_prev: 0.0 site_maf: 0.0 site_ratio: inf site_freq: 0.0 max_sites: inf

Selecting subset of samples... Estimating diversity metrics... Traceback (most recent call last): File "/home/ubuntu/MIDAS/scripts/snp_diversity.py", line 339, in pi = compute_pi(args, samples) File "/home/ubuntu/MIDAS/scripts/snp_diversity.py", line 215, in compute_pi for site in diversity.parse_sites(args['indir'], samples): # loop over sites File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 144, in parse_sites site = GenomicSite(files, samples, info) File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 19, in init self.ref_id, self.ref_pos, self.ref_allele = self.parse_id() File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 34, in parse_id ref_pos = int(self.id.split('|')[1]) ValueError: invalid literal for int() with base 10: 'quiver'

Having not had much coding experience, we do not know what this means or how to fix it. Do you have any suggestions?

snayfach commented 7 years ago

Thanks for reporting the error. I will look into it and get back to you.

marianne-LotterJones commented 7 years ago

We think the ValueError: invalid literal for int() with base 10: 'quiver' is based on our input system where the program is expecting an integer and instead there's string. We removed 'quiver' from our input file and the program ran.

snayfach commented 7 years ago

Hi Kyle - I'm posting your questions from email here in case it is useful to others. You wrote:

As you suggested, I ran the followup analysis to assess the intra-metagenome and inter-metagenome diversity of three metagenomic samples using the 'snp_diversity.py' script.   
It gave me the following output:   

sample_id     depth               count_sites     diversity
psb1out_3    1494503086     7159108         19335.1858363
lizzy2out       144701011       7159108         16136.8075072
lizzy9out       247182331       7159108         27027.809093

Could you explain the meaning of the diversity statistic?  Is this the number of unique strains found in each sample, or is there some other method you used to calculate this? 

First off, it looks like these are intra-sample diversity estimates. Could you run the script for inter-sample diversity also?

The diversity statistic is the sum of the number of nucleotide substitutions over the total number of sites analyzed. So if you divide by count_sites, you will get the number of substitutions per site. You can read more about this statistic here: https://en.wikipedia.org/wiki/Nucleotide_diversity

After dividing by the number of sites, per-bp diversity ranges from 2.2e-3 to 3.8e-3. In my experience this indicates a relatively low-diversity community that is probably composed of one dominant strain. If you look at figure 4 from my biorxiv paper you can compare your results to other human gut bacteria: http://biorxiv.org/content/early/2015/11/14/031757

I would recommend using the same script to compute between-sample (i.e. inter-sample) diversity. This script will pool the data across your 3 samples (assuming they are the same species) and use this pooled data to compute diversity again. If this number is much higher than the per-sample diversity, then you can conclude that the nucleotide differences between populations are much greater than those within the individual populations. Does this make sense?