rx32940 / Lepto-Metagenomics

3 stars 0 forks source link

metagenomic profiling #1

Open rx32940 opened 5 years ago

rx32940 commented 5 years ago

Below is the summary of Kraken2 result:

Sample ID Classified Unclassified
R22.K 14.72% 85.28%
R22.L 6.03% 93.97%
R22.S 13.46% 86.54%
R26.K 14.45% 85.55%
R26.L 7.55% 92.45%
R26.S 10.83% 89.17%
R27.K 13.85% 86.15%
R27.L 6.62% 93.38%
R27.S 10.89% 89.11%
R28.K 8.58% 91.42%
R28.L 7.45% 92.55%
R28.S 6.52% 93.48%

Bracken suppose to provide a better accuracy, eliminating false positives. However, unclassified portion is not reevaluate anymore

relative abundance on domain level: domain_classification

code to run Bracken on cluster

code ( to rerun the code, need to delete the non data sample files)

summary table:

Y >0.05 & <0.1
R >0.1  
  R22.K R22.L R22.S R26.K R26.L R26.S R27.K R27.L R27.S R28.K R28.L R28.S
Bacillus         Y Y Y R Y   Y Y
Bacteroides     Y                  
Bordetella               R        
Burkholderia   Y     Y   Y          
Escherichia R   R Y Y   Y Y R Y Y R
Leptospira                   R    
Mannheimia R Y R R R R Y   R Y R R
Microcystis     Y                  
Mycoplasma Y R   Y Y Y Y R Y   Y Y
Plantactinospora             Y          
Pseudomonas       Y     Y          
Streptomyces             Y          
Yersinia     Y                  
Ichnovirus Y                      
Pandoravirus R R R R R R Y R R R R R
rx32940 commented 5 years ago

PCA analysis is also done with the relative abundance result from Bracken. However, instead of base on phylum level abundance, the result was done with genus level abundance because virus classification levels do not deal with phylum level classification.

rx32940 commented 5 years ago

Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).

work to do:

rx32940 commented 5 years ago

Because of the differences in relative abundance shown in the company's analysis and Kraken2/Bracken result. I will done metagenomic profiling again with the software Clark/Clark(s).

work to do:

  • [ ] test out regular CLARK and relative abundance

    • [x] do illustration with the CLARK results for both genus and domain level
    • [x] compare with KRAKEN2/BRACKEN results.
    • [ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
  • [ ] test out CLARK(s) to increase on sensitivity

    • [ ] if CLARK(s) works do visualization on r and also pca analysis

CLARK include_UNKNOWN

Kraken2 w/ BRACKEN relative abundance re-estimated domain_classification_unkown code

company result (phylum level: only bacteria)

rx32940 commented 5 years ago

work to do:

  • [ ] test out regular CLARK and relative abundance

    • [x] do illustration with the CLARK results for both genus and domain level
    • [x] compare with KRAKEN2/BRACKEN results.
    • [ ] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance
  • [ ] test out CLARK(s) to increase on sensitivity

    • [ ] if CLARK(s) works do visualization on r and also pca analysis

Percentage of unclassified sequence

 **CLARK**

include_UNKNOWN Code:CLARK_domain_visual.R

 **Kraken2 w/ BRACKEN relative abundance re-estimated**

domain_classification_unkown Code: Bracken_extraction.R

company result (phylum level: only bacteria)

For genus level analysis:

The genus that were detected High in abundance is also different from the result found by Kraken2.

Y >0.05 & <0.1
R >=0.1
  R22.K R22.L R22.S R26.K R26.L R26.S R27.K R27.L R27.S R28.K R28.L R28.S
Bacteroides     Y                  
Bordetella               Y        
Leptospira                   R    
Mycoplasma Y R Y Y R R Y R R Y R R
Negativicoccus       R     R          
Plantactinospora R R Y R R R R R R   R Y
Ichnovirus Y                   Y Y
Pandoravirus R Y Y Y Y Y Y Y R Y R R

Code for genus result

Directory for complete genus composition table:

/Users/rx32940/Dropbox/5. Rachel's projects/Metagenomic_Analysis/CLARK:CLARK(s)/all_samples_CLARK.xlsx
rx32940 commented 5 years ago

work to do:

  • [x] test out regular CLARK and relative abundance

    • [x] do illustration with the CLARK results for both genus and domain level
    • [x] compare with KRAKEN2/BRACKEN results.
    • [x] do pca analysis with Clark abundance and compare with Kraken2/Bracken relative abundance

pca_genus_plot

rx32940 commented 5 years ago

The abundance distribution for samples from CLARK is very skewed toward 0 however, with extremely long and narrow tail. Most of the density under the distribution is at 0.

  Min. 1st Qu. Median Mean 3rd Qu. Max.
R22.K 0 0 1.73E-05 0.000896057889784946 0.0001490735 0.136032
R22.L 0 0 1.93E-05 0.000896051628136201 0.000164474 0.201693
R22.S 0 9.76E-06 2.93E-05 0.000896056622759857 0.000175762 0.0936325
R26.K 0 0 1.44E-05 0.000896057411290323 0.000129781 0.2176
R26.L 0 0 1.69E-05 0.000896055443548387 0.00017325875 0.151673
R26.S 0 0 1.54E-05 0.000896062451612903 0.000153931 0.173358
R27.K 0 0 1.13E-05 0.000896057904121864 0.000101596 0.306932
R27.L 0 0 1.13E-05 0.000896054120967742 0.000158685 0.200023
R27.S 0 0 1.27E-05 0.000896057788530466 0.000177415 0.125559
R28.K 0 0 1.32E-05 0.000896054403225806 0.000105624 0.359359
R28.L 0 0 1.53E-05 0.000896057774193548 0.000142594 0.132216
R28.S 0 0 1.87E-05 0.00089605263530466 0.000163914 0.138385

image

code: Code for abundance distribution

rx32940 commented 5 years ago

try:

TO DO:

rx32940 commented 5 years ago

19747304060d4d8e1437379b245880e886176d9c compare to bacteria + genus PCA with Bracken estimation & company's analysis

rx32940 commented 5 years ago

Clark

acb0c57fb8cb677622d7c5d07305b7b27dbf5ce2

rx32940 commented 5 years ago

code

Top10_genus

for genus code check the composition for the "other" category for each individual sample in the results_phylum and results folders

lsalvador commented 4 years ago

try PCA with top 10 phylums and genus.

rx32940 commented 4 years ago

please read: Hawinkel, S., Mattiello, F., Bijnens, L., & Thas, O. (2019). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics, 20(1), 210–221. https://doi.org/10.1093/bib/bbx104

Thomas, T., Gilbert, J., & Meyer, F. (2012). Metagenomics - a guide from sampling to data analysis. Microbial Informatics and Experimentation, 2(1), 3. https://doi.org/10.1186/2042-5783-2-3

for example, under phylum level, between samples from kidney and lungs, these bacterial phylum shown differences in abundances:

Taxa mean(group1) variance(group1) standard error(group1) mean(group2) variance(group2) standard error(group2) p value q value
kBacteria;pcandidate division WOR_3 0 0 0 0.0011118 6.00E-07 0.00038728 0.0184 0.04169362
kBacteria;pTenericutes 3.00E-05 3.05E-09 2.76E-05 0.135513 0.00894899 0.04729955 0.02343333 0.04169362
kBacteria;pFusobacteria 0 0 0 0.00177724 1.55E-06 0.00062314 0.02443333 0.04169362
kBacteria;pCandidatus Falkowbacteria 0 0 0 0.00057181 1.66E-07 0.00020397 0.0346 0.04169362
kBacteria;pThermotogae 0 0 0 0.00149152 1.13E-06 0.00053217 0.03963333 0.04169362
kBacteria;pFirmicutes 0.0029014 3.52E-06 0.00093768 0.02034824 0.0001814 0.00673431 0.0477 0.04169362
rx32940 commented 4 years ago

in addition to softwares described above: https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-530540958

differential abundance analysis even has been done with DEseq, see below: https://bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html

rx32940 commented 4 years ago

PCA with Metastats found Differentially abundant Phylums and Classes:

rx32940 commented 4 years ago

Metagenomics analysis book (chapter 3& 4) available at sci lib:

Screen Shot 2019-09-12 at 10 24 51 AM

Screen Shot 2019-09-16 at 1 21 18 PM

rx32940 commented 4 years ago

why did I do direct taxonomic classification?

"Direct taxonomic classification is useful for quantitative community profiling and identification of organisms with close relatives in the database. ... more qualitative understanding of the physiology of the uncultivated microbes. By identifying single-copy and conserved genes in the contig bins, taxonomy, genome completeness, as well as contamination, can be assessed. "

Breitwieser, F. P., Lu, J., & Salzberg, S. L. (n.d.). A review of methods and databases for metagenomic classification and assembly. https://doi.org/10.1093/bib/bbx120

rx32940 commented 4 years ago

Is relative abundance the right way to do metagenomic analysis?

What is wrong with correlating relative abundance? Everything!

McMurdie, P. J., & Holmes, S. (2014). Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Computational Biology, 10(4), e1003531. https://doi.org/10.1371/journal.pcbi.1003531

rx32940 commented 4 years ago

Try to use MG-RAST pipeline for analysis. link to new issue: https://github.com/rx32940/Lepto-Metagenomics/issues/2

rx32940 commented 4 years ago

Alpha Diversity statistics

These plots are based on absolute abundance results from Clark with genus level taxa

rx32940 commented 4 years ago

These plots are based on absolute abundance results from Bracken adjusted Kraken2 results with genus-level taxa Alpha_bracken

rx32940 commented 4 years ago

https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-533662980 https://github.com/rx32940/Lepto-Metagenomics/issues/1#issuecomment-535519502 code to generate last two trends in this issue

rx32940 commented 4 years ago

This paper talks about metagenomics method evaulation: https://www.nature.com/articles/s41598-018-30515-5

This papers also included the performance of Bracken: https://www.cell.com/cell/pdf/S0092-8674(19)30775-5.pdf