MG-RAST pipeline - Githubissues

rx32940 commented 5 years ago

This is a complete pipeline for metagenomic analysis. The purpose of trying out this pipeline is due to the discrepancies between my Kraken2/Clark results and the company's result. Due to the fact I skipped the QC and host clean step for Kraken2/Clark analyses (which I believe the company used the tool KneadData for this specific task), I want to use an established metagenomics pipeline to confirm the accuracy of my results. MG-RAST

rx32940 commented 5 years ago

upload all raw data to the web interface inbox through api upload code from local machine to web interface

rx32940 commented 5 years ago

I pair-end joined all the the forward and reverse fastq files for each sample before feeded into the pipeline. project name:pair-end-joined-metagenomes

pipeline options:

dereplication	yes
screening(hostclean)	M. musculus, NCBI v37
dynamic trimming	yes
minimum quality	10
maximum low quality basepairs	5

steps performed in the pipeline:

Screen Shot 2019-09-23 at 4 12 40 PM

example analysis for R22.L found in this link (need login info): https://www.mg-rast.org/mgmain.html?mgpage=overview&metagenome=d65d65c7036d676d343836303339302e33

R22.L genus level taxonomic distribution Screen Shot 2019-09-23 at 4 18 09 PM

KRONA result
The unclassfied section from CLARK and KRAKEN2 results are Eukaryote sequences. (unclassified because Eukaryote sequences were not in their database)
because Mus genome was screened in one of the pipeline steps, so the genome should be already removed. Why we can still receive mus hits for taxonomic profiling?
- what is the host? Rattus or Mus?
- if Rattus, I need to use bowtie to screen Rattus genome from the samples (reference not available as an option in MG-RAST)
  - use script from the pipeline (this can be find at download section after analysis)
- but why there are also Mus genome identified?
next step:
- download the after screening sequences from MG-RAST and submit to the pipeline again to confirm host cleanness of the sequence
- screening Rattus genome for more clean sequences. (ask Sree what is the host of the samples)
with representative hit, the absolute abundance of each sample
with representative hit, the relative abundance of each sample
in liver sample from subject R22 and R27. Bacteria abundance is significantly higher. This corresponds to the results provide by the company. However, R26.L does not have high abundance in Bacteria Domain. This is inconsistent with the company's result.

rx32940 commented 5 years ago

project name: test-hostcleaned

Mus screened data from MG-Rast still show a small amount of Mus genome and a large portion of Rodent genome. I will use bowtie to remove Rattus genome from the sequences.

This step was tested with only one sample R27.K

note that the sequence file was not only Mus screened by also preprocessed with other cleaning processes in the pipeline.

To do:

[x] email Sree about the exact host of the samples
- [ ] find Rattus reference genome
- [ ] screening with Rattus genome
[ ] feed into the pipeline again

conclusion: unclassification portion of the KRAKEN2 (LCA) and CLARK results belongs to Eukaryotes that was not presented in the database.

rx32940 commented 5 years ago

Host is Rat, the reference genome for rat is not available for MG-Rast pipeline. I have found the reference through the UCSC genome browser: Jul. 2014 (RGSC 6.0/rn6) assembly of the rat genome (rn6, RGSC Rnor_6.0) and downloaded through FTP.

I will use bowtie to screen with rat reference genome for the QC checked fastq files from MGRAST pipeline and feed them back into the pipeline for taxonomic profiling again

bowtie2 screening with Rattus reference genome code

Because this task takes a very long time, I decide to use the two passed screening sequences first to test:

project name in MG-Rast: Rattus_hostcleaned_test
samples:
- R22.L_hostcleaned (mgm4860391)
- R22.S_hostcleaned (mgm4860390)

rx32940 commented 5 years ago

R22.S host cleaned data from the company for comparison with host cleaned data with bowtie2 in the pipeline.

rx32940 / Lepto-Metagenomics

MG-RAST pipeline #2