Open rx32940 opened 5 years ago
PCA analysis is also done with the relative abundance result from Bracken. However, instead of base on phylum level abundance, the result was done with genus level abundance because virus classification levels do not deal with phylum level classification.
The pca analysis is not very different from that done by the company in term of sample clustering relationships. However, company's analysis only dealt with the bacteria abundance while the analysis below deals with both bacteria and viruses.
Genus level analysis (bacteria + viruses): code for pca analysis: https://github.com/rx32940/Lepto-Metagenomics/blob/5ae384e92b1ec4681cf34e17ebf02ead66f9cc8a/PCA_genus_relative_abundance.R
phylum level analysis (bacteria only):
Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).
work to do:
Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).
Clark should have similar sensitivity and accuracy with Kraken2.
- both using the Kmers algorithm.
- Clark(s) uses discriminative spaced kmers algorithm, which suppose to offer higher sensitivity without sacrificing the accuracy and precision. Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers
work to do:
[ ] test out regular CLARK and relative abundance
- [x] do illustration with the CLARK results for both genus and domain level
- [x] compare with KRAKEN2/BRACKEN results.
- [ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
[ ] test out CLARK(s) to increase on sensitivity
- [ ] if CLARK(s) works do visualization on r and also pca analysis
CLARK
Kraken2 w/ BRACKEN relative abundance re-estimated code
company result (phylum level: only bacteria)
work to do:
[ ] test out regular CLARK and relative abundance
- [x] do illustration with the CLARK results for both genus and domain level
- [x] compare with KRAKEN2/BRACKEN results.
- [ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
[ ] test out CLARK(s) to increase on sensitivity
- [ ] if CLARK(s) works do visualization on r and also pca analysis
slightly more percentage of sequences was classified by CLARK compare to that of KRAKEN2 results. (BRACKEN ignores unclassified sequences, only count the abundance base upon Kraken2 classified sequences) However, the pattern of classified sequences percentage are very similar.
notice that classified bacteria sequences were greatly increase with CLARK analysis
Percentage of unclassified sequence
**CLARK**
**Kraken2 w/ BRACKEN relative abundance re-estimated**
company result (phylum level: only bacteria)
Viruses do not have phylum classification level. thus I can't make graph with Viruses included for comparsion
Classified sequence comparison
notice that classified bacteria sequences were greatly increase with CLARK analysis.
Archaea abundance is also higher in CLARK analysis
CLARK
Kraken2
For genus level analysis:
The genus that were detected High in abundance is also different from the result found by Kraken2.
compare with results from Kraken2/bracken. The Bacteria that is found most in abundance is not very similar. See Here for Kraken2/bracken result
Mycoplasma seems like the only bacteria that shown high in abundance in all samples that are consistent with both analysis
Viruses composition, however, shown consistency throughout the two analyses
Y | >0.05 & <0.1 |
---|---|
R | >=0.1 |
R22.K | R22.L | R22.S | R26.K | R26.L | R26.S | R27.K | R27.L | R27.S | R28.K | R28.L | R28.S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Bacteroides | Y | |||||||||||
Bordetella | Y | |||||||||||
Leptospira | R | |||||||||||
Mycoplasma | Y | R | Y | Y | R | R | Y | R | R | Y | R | R |
Negativicoccus | R | R | ||||||||||
Plantactinospora | R | R | Y | R | R | R | R | R | R | R | Y | |
Ichnovirus | Y | Y | Y | |||||||||
Pandoravirus | R | Y | Y | Y | Y | Y | Y | Y | R | Y | R | R |
Directory for complete genus composition table:
/Users/rx32940/Dropbox/5. Rachel's projects/Metagenomic_Analysis/CLARK:CLARK(s)/all_samples_CLARK.xlsx
work to do:
[x] test out regular CLARK and relative abundance
- [x] do illustration with the CLARK results for both genus and domain level
- [x] compare with KRAKEN2/BRACKEN results.
- [x] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
PCA clustering is not only inconsistent with KRAKEN2/Bracken and company's result. The pattern of clustering is also unclear.
Refer to this issue for comparing with PCA analysis from KRAKEN2/Bracken analysis and company's analysis: issue
Compare to KRAKEN2/Bracken result, CLARK has significantly more genus identified. out of all samples, there were > 1000 genus were taken into consideration during the PCA analysis, while Kraken result only has ~500 genus.
Maybe should only includes bacteria genus that shown significant amount in abundance as part of the sample composition for PCA analysis?
The abundance distribution for samples from CLARK is very skewed toward 0 however, with extremely long and narrow tail. Most of the density under the distribution is at 0.
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
R22.K | 0 | 0 | 1.73E-05 | 0.000896057889784946 | 0.0001490735 | 0.136032 |
R22.L | 0 | 0 | 1.93E-05 | 0.000896051628136201 | 0.000164474 | 0.201693 |
R22.S | 0 | 9.76E-06 | 2.93E-05 | 0.000896056622759857 | 0.000175762 | 0.0936325 |
R26.K | 0 | 0 | 1.44E-05 | 0.000896057411290323 | 0.000129781 | 0.2176 |
R26.L | 0 | 0 | 1.69E-05 | 0.000896055443548387 | 0.00017325875 | 0.151673 |
R26.S | 0 | 0 | 1.54E-05 | 0.000896062451612903 | 0.000153931 | 0.173358 |
R27.K | 0 | 0 | 1.13E-05 | 0.000896057904121864 | 0.000101596 | 0.306932 |
R27.L | 0 | 0 | 1.13E-05 | 0.000896054120967742 | 0.000158685 | 0.200023 |
R27.S | 0 | 0 | 1.27E-05 | 0.000896057788530466 | 0.000177415 | 0.125559 |
R28.K | 0 | 0 | 1.32E-05 | 0.000896054403225806 | 0.000105624 | 0.359359 |
R28.L | 0 | 0 | 1.53E-05 | 0.000896057774193548 | 0.000142594 | 0.132216 |
R28.S | 0 | 0 | 1.87E-05 | 0.00089605263530466 | 0.000163914 | 0.138385 |
try:
TO DO:
19747304060d4d8e1437379b245880e886176d9c compare to bacteria + genus PCA with Bracken estimation & company's analysis
Clark
Bacteria Phylum only PCA analysis
Bacteria and Archaea Phylum only PCA analysis
Bacteria and Archaea phylum + Viruses genus PCA analysis
acb0c57fb8cb677622d7c5d07305b7b27dbf5ce2
for genus code check the composition for the "other" category for each individual sample in the results_phylum and results folders
try PCA with top 10 phylums and genus.
please read: Hawinkel, S., Mattiello, F., Bijnens, L., & Thas, O. (2019). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics, 20(1), 210–221. https://doi.org/10.1093/bib/bbx104
Thomas, T., Gilbert, J., & Meyer, F. (2012). Metagenomics - a guide from sampling to data analysis. Microbial Informatics and Experimentation, 2(1), 3. https://doi.org/10.1186/2042-5783-2-3
[ ] Metastats
[ ] Anosim
the company actually did differentially abundant analysis with metastats under all different taxonomic levels, here is the result directory:
/project/lslab/lab_shared/leptoData/Metagenomics_Analyzed_Results/04.Taxonomy/MicroNR_stat/MetaStats
for example, under phylum level, between samples from kidney and lungs, these bacterial phylum shown differences in abundances:
Taxa | mean(group1) | variance(group1) | standard error(group1) | mean(group2) | variance(group2) | standard error(group2) | p value | q value |
---|---|---|---|---|---|---|---|---|
kBacteria;pcandidate division WOR_3 | 0 | 0 | 0 | 0.0011118 | 6.00E-07 | 0.00038728 | 0.0184 | 0.04169362 |
kBacteria;pTenericutes | 3.00E-05 | 3.05E-09 | 2.76E-05 | 0.135513 | 0.00894899 | 0.04729955 | 0.02343333 | 0.04169362 |
kBacteria;pFusobacteria | 0 | 0 | 0 | 0.00177724 | 1.55E-06 | 0.00062314 | 0.02443333 | 0.04169362 |
kBacteria;pCandidatus Falkowbacteria | 0 | 0 | 0 | 0.00057181 | 1.66E-07 | 0.00020397 | 0.0346 | 0.04169362 |
kBacteria;pThermotogae | 0 | 0 | 0 | 0.00149152 | 1.13E-06 | 0.00053217 | 0.03963333 | 0.04169362 |
kBacteria;pFirmicutes | 0.0029014 | 3.52E-06 | 0.00093768 | 0.02034824 | 0.0001814 | 0.00673431 | 0.0477 | 0.04169362 |
other illustrations also available in the directory
is the PCA made with only these phylums alone? (note PCA only done under phylum& class, level)
Anosim Analysis also seems very useful for our analysis
a sample anosim box plot:
in addition to softwares described above: https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-530540958
differential abundance analysis even has been done with DEseq, see below: https://bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html
To do:
[ ] Metastats (now in metagenomeSeq in R or through Mothur )
[ ] Anosim (Here)
[ ] DEseq
PCA with Metastats found Differentially abundant Phylums and Classes:
Phylum
Classes
Metagenomics analysis book (chapter 3& 4) available at sci lib:
why did I do direct taxonomic classification?
"Direct taxonomic classification is useful for quantitative community profiling and identification of organisms with close relatives in the database. ... more qualitative understanding of the physiology of the uncultivated microbes. By identifying single-copy and conserved genes in the contig bins, taxonomy, genome completeness, as well as contamination, can be assessed. "
Breitwieser, F. P., Lu, J., & Salzberg, S. L. (n.d.). A review of methods and databases for metagenomic classification and assembly. https://doi.org/10.1093/bib/bbx120
Is relative abundance the right way to do metagenomic analysis?
we shouldn't use relative abundance for downstream metagenomic analyses after taxonomic profiling, for example, differential abundance. This could not only resulting in high false postive rates in identifying differentially abundant taxa, but also creating irrelavant annotations for the data.
please read these to articles:
What is wrong with correlating relative abundance? Everything!
McMurdie, P. J., & Holmes, S. (2014). Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Computational Biology, 10(4), e1003531. https://doi.org/10.1371/journal.pcbi.1003531
Try to use MG-RAST pipeline for analysis. link to new issue: https://github.com/rx32940/Lepto-Metagenomics/issues/2
Alpha Diversity statistics
These plots are based on absolute abundance results from Clark with genus level taxa
from the pattern of the alpha diversities within each sample, we can tell that both the richness and the evenness of the diversity does not show clear pattern of consistency across the tissue types
only four genus were found differentially abundant across three different tissues
differentially abundant genus across 4 subjects sampled from
These plots are based on absolute abundance results from Bracken adjusted Kraken2 results with genus-level taxa
differentially abundant across three different tissues
differentially abundant genus across 4 subjects sampled from
This paper talks about metagenomics method evaulation: https://www.nature.com/articles/s41598-018-30515-5
This papers also included the performance of Bracken: https://www.cell.com/cell/pdf/S0092-8674(19)30775-5.pdf
Below is the summary of Kraken2 result:
Bracken suppose to provide a better accuracy, eliminating false positives. However, unclassified portion is not reevaluate anymore
relative abundance on domain level:
code to run Bracken on cluster
code ( to rerun the code, need to delete the non data sample files)
/Users/rx32940/Dropbox/5. Rachel's projects/Metagenomic_Analysis/KRAKEN2:BRACKEN/genus/genus_classfication.xlsx
summary table: