Closed kylecampbell3 closed 6 years ago
Hi Kyle,
Unfortunately at the current time, MIDAS only produces strain phylogenies based on consensus sequences. As you observe, that means that you will get only 1 leaf per metagenome. These phylogenies will be accurate only if the intra-metagenome diversity is low.
I would suggest performing some followup analysis to assess the
intra-metagenome and inter-metagenome diversity levels. The script
snp_diversity.py
should help in that regard. If the intra-diversity is
very low, you have clonal populations and the phylogenetic tree based on
consensus sequences is likely accurate. Please let me know if you have any
questions.
In the future I may add code to estimate strain phylogenies that take into account multiple strains/species/sample.
Best, Stephen
On Wed, May 3, 2017 at 3:43 PM, kylecampbell3 notifications@github.com wrote:
Hi Stephen,
Thank you for your program!
I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?
Thanks for your help,
Kyle Campbell
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/snayfach/MIDAS/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/ACAbrWWnWlQsYw_esx3imKT-M-HRCsOmks5r2QMLgaJpZM4NQCtz .
Hi Stephen, I am working with Kyle on this project. Thank you for your suggestion! However, when we enter the code, we keep coming across this error:
ubuntu@ip-172-31-61-218:~/MIDAS$ python /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity
MIDAS: Metagenomic Intra-species Diversity Analysis System version 1.2.1; github.com/snayfach/MIDAS Copyright (C) 2015-2016 Stephen Nayfach Freely distributed under the GNU General Public License (GPLv3)
=============================== ===========Parameters=========== Command: /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity Script: snp_diversity.py Input directory: /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ Output file: /home/ubuntu/MIDAS/snp_diversity Diversity options: genomic_type: genome-wide sample_type: per-sample site_type: ['NC', '1D', '2D', '3D', '4D'] weight_by_depth: False rand_reads: None replace_reads: False rand_samples: None rand_sites: None Sample filters: sample_depth: 0.0 fract_cov: 0.0 max_samples: inf keep_samples: None exclude_samples: None Site filters: site_depth: 2 site_prev: 0.0 site_maf: 0.0 site_ratio: inf site_freq: 0.0 max_sites: inf
Selecting subset of samples...
Estimating diversity metrics...
Traceback (most recent call last):
File "/home/ubuntu/MIDAS/scripts/snp_diversity.py", line 339, in
Having not had much coding experience, we do not know what this means or how to fix it. Do you have any suggestions?
Thanks for reporting the error. I will look into it and get back to you.
We think the ValueError: invalid literal for int() with base 10: 'quiver' is based on our input system where the program is expecting an integer and instead there's string. We removed 'quiver' from our input file and the program ran.
Hi Kyle - I'm posting your questions from email here in case it is useful to others. You wrote:
As you suggested, I ran the followup analysis to assess the intra-metagenome and inter-metagenome diversity of three metagenomic samples using the 'snp_diversity.py' script.
It gave me the following output:
sample_id depth count_sites diversity
psb1out_3 1494503086 7159108 19335.1858363
lizzy2out 144701011 7159108 16136.8075072
lizzy9out 247182331 7159108 27027.809093
Could you explain the meaning of the diversity statistic? Is this the number of unique strains found in each sample, or is there some other method you used to calculate this?
First off, it looks like these are intra-sample diversity estimates. Could you run the script for inter-sample diversity also?
The diversity statistic is the sum of the number of nucleotide substitutions over the total number of sites analyzed. So if you divide by count_sites, you will get the number of substitutions per site. You can read more about this statistic here: https://en.wikipedia.org/wiki/Nucleotide_diversity
After dividing by the number of sites, per-bp diversity ranges from 2.2e-3 to 3.8e-3. In my experience this indicates a relatively low-diversity community that is probably composed of one dominant strain. If you look at figure 4 from my biorxiv paper you can compare your results to other human gut bacteria: http://biorxiv.org/content/early/2015/11/14/031757
I would recommend using the same script to compute between-sample (i.e. inter-sample) diversity. This script will pool the data across your 3 samples (assuming they are the same species) and use this pooled data to compute diversity again. If this number is much higher than the per-sample diversity, then you can conclude that the nucleotide differences between populations are much greater than those within the individual populations. Does this make sense?
Hi Stephen,
Thank you for your program!
I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?
Thanks for your help,
Kyle Campbell