metagenomic profiling - Githubissues

rx32940 commented 5 years ago

KRAKEN2 + Bracken already done.
results can be found in the dropbox folder

Below is the summary of Kraken2 result:

Sample ID	Classified	Unclassified
R22.K	14.72%	85.28%
R22.L	6.03%	93.97%
R22.S	13.46%	86.54%
R26.K	14.45%	85.55%
R26.L	7.55%	92.45%
R26.S	10.83%	89.17%
R27.K	13.85%	86.15%
R27.L	6.62%	93.38%
R27.S	10.89%	89.11%
R28.K	8.58%	91.42%
R28.L	7.45%	92.55%
R28.S	6.52%	93.48%

Bracken suppose to provide a better accuracy, eliminating false positives. However, unclassified portion is not reevaluate anymore

relative abundance on domain level: domain_classification

code to run Bracken on cluster

code ( to rerun the code, need to delete the non data sample files)

below is the directory to classified result: /Users/rx32940/Dropbox/5. Rachel's projects/Metagenomic_Analysis/KRAKEN2:BRACKEN/genus/genus_classfication.xlsx

summary table:

Y	>0.05 & <0.1
R	>0.1

	R22.K	R22.L	R22.S	R26.K	R26.L	R26.S	R27.K	R27.L	R27.S	R28.K	R28.L	R28.S
Bacillus					Y	Y	Y	R	Y		Y	Y
Bacteroides			Y
Bordetella								R
Burkholderia		Y			Y		Y
Escherichia	R		R	Y	Y		Y	Y	R	Y	Y	R
Leptospira										R
Mannheimia	R	Y	R	R	R	R	Y		R	Y	R	R
Microcystis			Y
Mycoplasma	Y	R		Y	Y	Y	Y	R	Y		Y	Y
Plantactinospora							Y
Pseudomonas				Y			Y
Streptomyces							Y
Yersinia			Y
Ichnovirus	Y
Pandoravirus	R	R	R	R	R	R	Y	R	R	R	R	R

rx32940 commented 5 years ago

PCA analysis is also done with the relative abundance result from Bracken. However, instead of base on phylum level abundance, the result was done with genus level abundance because virus classification levels do not deal with phylum level classification.

The pca analysis is not very different from that done by the company in term of sample clustering relationships. However, company's analysis only dealt with the bacteria abundance while the analysis below deals with both bacteria and viruses.
- Genus level analysis (bacteria + viruses): code for pca analysis: https://github.com/rx32940/Lepto-Metagenomics/blob/5ae384e92b1ec4681cf34e17ebf02ead66f9cc8a/PCA_genus_relative_abundance.R
- phylum level analysis (bacteria only):

rx32940 commented 5 years ago

Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).

Clark should have similar sensitivity and accuracy with Kraken2.
- both using the Kmers algorithm.
Clark(s) uses discriminative spaced kmers algorithm, which suppose to offer higher sensitivity without sacrificing the accuracy and precision. Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers

work to do:

[ ] test out regular CLARK and relative abundance
- [x] do illustration with the CLARK results for both genus and domain level
- [x] compare with KRAKEN2/BRACKEN results.
- [ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
[ ] test out CLARK(s) to increase on sensitivity
- [ ] if CLARK(s) works do visualization on r and also pca analysis

rx32940 commented 5 years ago

Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).

Clark should have similar sensitivity and accuracy with Kraken2.

both using the Kmers algorithm.

Clark(s) uses discriminative spaced kmers algorithm, which suppose to offer higher sensitivity without sacrificing the accuracy and precision. Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers

work to do:

[ ] test out regular CLARK and relative abundance

[x] do illustration with the CLARK results for both genus and domain level

[x] compare with KRAKEN2/BRACKEN results.

[ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance

[ ] test out CLARK(s) to increase on sensitivity

[ ] if CLARK(s) works do visualization on r and also pca analysis

slightly more percentage of sequences was classified by CLARK compare to that of KRAKEN2 results. (BRACKEN ignores unclassified sequences, only count the abundance base upon Kraken2 classified sequences) However, the pattern of classified sequences percentage are very similar.

CLARK include_UNKNOWN

Kraken2 w/ BRACKEN relative abundance re-estimated domain_classification_unkown code

company result (phylum level: only bacteria)

Viruses do not have phylum classification level. thus I can't make graph with Viruses included for comparsion

rx32940 commented 5 years ago

work to do:

[ ] test out regular CLARK and relative abundance

[x] do illustration with the CLARK results for both genus and domain level

[x] compare with KRAKEN2/BRACKEN results.

[ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance

[ ] test out CLARK(s) to increase on sensitivity

[ ] if CLARK(s) works do visualization on r and also pca analysis

slightly more percentage of sequences was classified by CLARK compare to that of KRAKEN2 results. (BRACKEN ignores unclassified sequences, only count the abundance base upon Kraken2 classified sequences) However, the pattern of classified sequences percentage are very similar.
notice that classified bacteria sequences were greatly increase with CLARK analysis

Percentage of unclassified sequence

 **CLARK**

include_UNKNOWN Code:CLARK_domain_visual.R

 **Kraken2 w/ BRACKEN relative abundance re-estimated**

domain_classification_unkown Code: Bracken_extraction.R

company result (phylum level: only bacteria)

Viruses do not have phylum classification level. thus I can't make graph with Viruses included for comparsion
Classified sequence comparison
notice that classified bacteria sequences were greatly increase with CLARK analysis.
Archaea abundance is also higher in CLARK analysis

CLARK

Kraken2

For genus level analysis:

The genus that were detected High in abundance is also different from the result found by Kraken2.

compare with results from Kraken2/bracken. The Bacteria that is found most in abundance is not very similar. See Here for Kraken2/bracken result
Mycoplasma seems like the only bacteria that shown high in abundance in all samples that are consistent with both analysis
Viruses composition, however, shown consistency throughout the two analyses

Y	>0.05 & <0.1
R	>=0.1

	R22.K	R22.L	R22.S	R26.K	R26.L	R26.S	R27.K	R27.L	R27.S	R28.K	R28.L	R28.S
Bacteroides			Y
Bordetella								Y
Leptospira										R
Mycoplasma	Y	R	Y	Y	R	R	Y	R	R	Y	R	R
Negativicoccus				R			R
Plantactinospora	R	R	Y	R	R	R	R	R	R		R	Y
Ichnovirus	Y										Y	Y
Pandoravirus	R	Y	Y	Y	Y	Y	Y	Y	R	Y	R	R

Code for genus result

Directory for complete genus composition table:

/Users/rx32940/Dropbox/5. Rachel's projects/Metagenomic_Analysis/CLARK:CLARK(s)/all_samples_CLARK.xlsx

rx32940 commented 5 years ago

work to do:

[x] test out regular CLARK and relative abundance

[x] do illustration with the CLARK results for both genus and domain level

[x] compare with KRAKEN2/BRACKEN results.

[x] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance

PCA with CLARK's genus relative abundance result

pca_genus_plot

PCA clustering is not only inconsistent with KRAKEN2/Bracken and company's result. The pattern of clustering is also unclear.
Refer to this issue for comparing with PCA analysis from KRAKEN2/Bracken analysis and company's analysis: issue
Compare to KRAKEN2/Bracken result, CLARK has significantly more genus identified. out of all samples, there were > 1000 genus were taken into consideration during the PCA analysis, while Kraken result only has ~500 genus.
Maybe should only includes bacteria genus that shown significant amount in abundance as part of the sample composition for PCA analysis?

rx32940 commented 5 years ago

The abundance distribution for samples from CLARK is very skewed toward 0 however, with extremely long and narrow tail. Most of the density under the distribution is at 0.

below is the summary table

	1st Qu.	Median	Mean	3rd Qu.	Max.
R22.K	0	1.73E-05	0.000896057889784946	0.0001490735	0.136032
R22.L	0	1.93E-05	0.000896051628136201	0.000164474	0.201693
R22.S	9.76E-06	2.93E-05	0.000896056622759857	0.000175762	0.0936325
R26.K	0	1.44E-05	0.000896057411290323	0.000129781	0.2176
R26.L	0	1.69E-05	0.000896055443548387	0.00017325875	0.151673
R26.S	0	1.54E-05	0.000896062451612903	0.000153931	0.173358
R27.K	0	1.13E-05	0.000896057904121864	0.000101596	0.306932
R27.L	0	1.13E-05	0.000896054120967742	0.000158685	0.200023
R27.S	0	1.27E-05	0.000896057788530466	0.000177415	0.125559
R28.K	0	1.32E-05	0.000896054403225806	0.000105624	0.359359
R28.L	0	1.53E-05	0.000896057774193548	0.000142594	0.132216
R28.S	0	1.87E-05	0.00089605263530466	0.000163914	0.138385

sample density for R22.K

code: Code for abundance distribution

rx32940 commented 5 years ago

in order to reduce the number of the variables taken into the consideration for PCA analysis, we need to either directly cut down genus identified under certain level of abundance or combine some of the genus together (thus instead of detecting abundance on the genus level, we want to use the composition with phylum level variables)
- PC1 and PC2 combined PCA plot using genus level composition accounts for ~40% of the dataset with CLARK analysis. ~42% with KRAKEN2 analysis. not descriptive enough
  - the range of genus detected by CLARK and KRAKEN2 is also very different, especially for bacteria. Many detected by CLARK was not found in KRAKEN2 result, vice versa.

try:

because bacteria does not have phylum as a classification level, we will keep genus classifications identified with bacteria. -However, CLARK and KRAKEN2(only BRACKEN need to redone for estimate the abundance) will be rerun on phylum level.
combine these two statistics for PCA analysis

TO DO:

[x] redo CLARK analysis in phylum level
[x] redo BRACKEN to estimate abundance on phylum scale
[x] combine viruses genus stats with phylum stats for both analysis and do PCA analysis for both softwares
[x] do PCA for bacteria phylum alone with results for both softwares

rx32940 commented 5 years ago

Bacteria and Archaea Phylum only PCA analysis (Bracken estimated abundance)
Bacteria phylum + Viruses genus PCA analysis (Bracken estimated abundance)

19747304060d4d8e1437379b245880e886176d9c compare to bacteria + genus PCA with Bracken estimation & company's analysis

rx32940 commented 5 years ago

Clark

Bacteria Phylum only PCA analysis
Bacteria and Archaea Phylum only PCA analysis
Bacteria and Archaea phylum + Viruses genus PCA analysis

acb0c57fb8cb677622d7c5d07305b7b27dbf5ce2

rx32940 commented 5 years ago

relative abundance with bacteria phylums plot (for the purpose of comparing with the company's result). only show legend for top 10 most abundant phylum fir each sample. rest combined into others. (this plot is done with clark result)

code

relative abundance with genus plot. only show legend for top 10 most abundant genus for each sample. rest combined into others.

Top10_genus

for genus code check the composition for the "other" category for each individual sample in the results_phylum and results folders

note: in phylum composition. bacteria genus couldn't be classified.

lsalvador commented 4 years ago

try PCA with top 10 phylums and genus.

PCA for all genus and phylum is not making sense because of the large similarities between samples that are not contributing much to the composition of each sample.

rx32940 commented 4 years ago

differential abundance:
- instead of comparing composition of samples as a whole, we want to find a biomarker(s) that is responsible for the disease or symptoms for the specific tissue.
- thus our goal is to find the differentially abundant taxa (can be in phylum or genus level)
  - what we have:
  - relative abundance: because absolute abundance of reads is the result of technical procedures.

please read: Hawinkel, S., Mattiello, F., Bijnens, L., & Thas, O. (2019). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics, 20(1), 210–221. https://doi.org/10.1093/bib/bbx104

Thomas, T., Gilbert, J., & Meyer, F. (2012). Metagenomics - a guide from sampling to data analysis. Microbial Informatics and Experimentation, 2(1), 3. https://doi.org/10.1186/2042-5783-2-3

[ ] Metastats
[ ] Anosim
the company actually did differentially abundant analysis with metastats under all different taxonomic levels, here is the result directory:
- metastats does pairwise non-parametric T-tetst that determines whether there are any taxas that are differentially represented between the samples
```
/project/lslab/lab_shared/leptoData/Metagenomics_Analyzed_Results/04.Taxonomy/MicroNR_stat/MetaStats
```

for example, under phylum level, between samples from kidney and lungs, these bacterial phylum shown differences in abundances:

Taxa	mean(group1)	variance(group1)	standard error(group1)	mean(group2)	variance(group2)	standard error(group2)	p value	q value
kBacteria;pcandidate division WOR_3	0	0	0	0.0011118	6.00E-07	0.00038728	0.0184	0.04169362
kBacteria;pTenericutes	3.00E-05	3.05E-09	2.76E-05	0.135513	0.00894899	0.04729955	0.02343333	0.04169362
kBacteria;pFusobacteria	0	0	0	0.00177724	1.55E-06	0.00062314	0.02443333	0.04169362
kBacteria;pCandidatus Falkowbacteria	0	0	0	0.00057181	1.66E-07	0.00020397	0.0346	0.04169362
kBacteria;pThermotogae	0	0	0	0.00149152	1.13E-06	0.00053217	0.03963333	0.04169362
kBacteria;pFirmicutes	0.0029014	3.52E-06	0.00093768	0.02034824	0.0001814	0.00673431	0.0477	0.04169362

other illustrations also available in the directory
is the PCA made with only these phylums alone? (note PCA only done under phylum& class, level)
Anosim Analysis also seems very useful for our analysis
- Analysis of similarities (ANOSIM) is a non-parametric statistical test. tests whether we can reject the null hypothesis that the similarity between groups is greater than or equal to the similarity within the groups
a sample anosim box plot:

rx32940 commented 4 years ago

in addition to softwares described above: https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-530540958

differential abundance analysis even has been done with DEseq, see below: https://bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html

noted: small sample size can lead to high False Discover Rate(FDR)

rx32940 commented 4 years ago

To do:
[ ] Metastats (now in metagenomeSeq in R or through Mothur )
- [ ] PCA analysis with phylum level
- [ ] PCA with Class level?
- PCA done by the company for these two levels are very different. With class level PCA having PC1 representing 100% of the data. See below.
- PCA is probably not a good way for analysis
[ ] Anosim (Here)
[ ] DEseq

PCA with Metastats found Differentially abundant Phylums and Classes:

Phylum
Classes

rx32940 commented 4 years ago

Metagenomics analysis book (chapter 3& 4) available at sci lib:

Nagarajan, M. (2018). Metagenomics : perspectives, methods, and applications. Elsevier/Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=cat06564a&AN=uga.9949158838002959&site=eds-live

Screen Shot 2019-09-12 at 10 24 51 AM

The workflow for metagenomic taxonomic profiling for taxomomic abundance output
we choose direct metagnomic taxomic profiling for the following reasons: https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-531870444
relative abundance shouldn't be used for differential abundance due to reasons specified in this https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-531872224.

Screen Shot 2019-09-16 at 1 21 18 PM

rx32940 commented 4 years ago

why did I do direct taxonomic classification?

"Direct taxonomic classification is useful for quantitative community profiling and identification of organisms with close relatives in the database. ... more qualitative understanding of the physiology of the uncultivated microbes. By identifying single-copy and conserved genes in the contig bins, taxonomy, genome completeness, as well as contamination, can be assessed. "

Breitwieser, F. P., Lu, J., & Salzberg, S. L. (n.d.). A review of methods and databases for metagenomic classification and assembly. https://doi.org/10.1093/bib/bbx120

rx32940 commented 4 years ago

Is relative abundance the right way to do metagenomic analysis?

we shouldn't use relative abundance for downstream metagenomic analyses after taxonomic profiling, for example, differential abundance. This could not only resulting in high false postive rates in identifying differentially abundant taxa, but also creating irrelavant annotations for the data.
please read these to articles:

What is wrong with correlating relative abundance? Everything!

McMurdie, P. J., & Holmes, S. (2014). Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Computational Biology, 10(4), e1003531. https://doi.org/10.1371/journal.pcbi.1003531

rx32940 commented 4 years ago

Try to use MG-RAST pipeline for analysis. link to new issue: https://github.com/rx32940/Lepto-Metagenomics/issues/2

rx32940 commented 4 years ago

Alpha Diversity statistics

species diversity in sites or habitats at a local scale explanation for different indices http://www.evolution.unibas.ch/walser/bacteria_community_analysis/2015-02-10_MBM_tutorial_combined.pdf https://entnemdept.ifas.ufl.edu/hodges/ProtectUs/lp_webfolder/9_12_grade/Student_Handout_1A.pdf

These plots are based on absolute abundance results from Clark with genus level taxa

from the pattern of the alpha diversities within each sample, we can tell that both the richness and the evenness of the diversity does not show clear pattern of consistency across the tissue types
only four genus were found differentially abundant across three different tissues
differentially abundant genus across 4 subjects sampled from

rx32940 commented 4 years ago

These plots are based on absolute abundance results from Bracken adjusted Kraken2 results with genus-level taxa Alpha_bracken

differentially abundant across three different tissues
differentially abundant genus across 4 subjects sampled from

rx32940 commented 4 years ago

https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-533662980 https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-535519502 code to generate last two trends in this issue

rx32940 commented 4 years ago

differences in alpha diversity from the two software are caused by number of taxa identified
for bracken estimated kraken2 results, the number of taxa identified was lower than those identified by CLARK

This paper talks about metagenomics method evaulation: https://www.nature.com/articles/s41598-018-30515-5

This papers also included the performance of Bracken: https://www.cell.com/cell/pdf/S0092-8674(19)30775-5.pdf

rx32940 / Lepto-Metagenomics

metagenomic profiling #1