Closed SilasK closed 5 years ago
First, here is an article which explains the fundamental difference between read-based tools like Humans and assembly based tools like Atlas: http://www.nature.com/doifinder/10.1038/nbt.3935
I hope you don’t mind that the Human database is constructed for human microbiota and not updated since 5 years. Anyway I suggest to check how many reads human cannot annotate ‘unclassified’.
Humann and metaphlan ignore paired end reads and treat use only single end reads, that’s right. I don’t think the default mapping parameters are optimised for merged reads, e.g. what is the overlap they require between read and gene?
I would’t merge them and just use them as single end reads, so you will have two times more. You can use ‘atlas QC myconfig.yaml’ to get the QC_reads and concatenate them.
Kind regards Silas
Dear Silas,
Thank you very much for your prompt reply. I tested one sample with humann2 and metaphlan2. As you mentioned, these two pipelines worked very bad with the environmental dataset, the taxonomical and functional profiling missed 99% and 98% of total QC reads. That is far out of my expectation. The atlas assinged these reads far higher in this aspect. Frankly, I am new in analyzing metagenomic data and get confused at the current stage. I think it is very important to choose a right pipeline in the beginning. To generate the taxonomic and functional profile for environmental sample, what tools or pipelines do you think are more appropriate for environmental samples? You have mentioned kraken, then I will try it soon. In addition, do you think kallisto will fit? Has atlas ever considered to include the two blocks?
Looking forward to your instruction. Thank you and best regards,
cam315
Welcome to the club.
I also work on poorly characterised environments. I don’t think that neither Kraken nor Kallisto, will help you much because they both depend on databases. And If there are not many genomes in gene bank this two tools won’t classify much reads.
For environments lacking genome references, people have often used assembly based metagenomics (see article from last mail). Using assembly based metagenomics one tries to reconstruct “meta genome assembled genomes” (MAGs), which are then more or less complete and more or less new or closely related to existing taxonomies.
As I see it Atlas is a pipeline combining (the best) state of the art tools to reconstruct genomes from meta-genomes. However Atlas doesn’t produce output tables combining the results from different samples, rather it outpus tables with gene annotation for each MAG, specific too each sample.
Have you already run Atlas until the end? As I’m also interested in such summary tables I will try to help you. What exactly are you looking for?
How do you expect a taxonomy to look like if you have newly identified species in your sample?
Yes, I've tried running Atlas until the end and have got a bunch of data in separate samples. I would assume that the assembly-based MAGs are advanced analysis, while the taxonomic and functional profile are basic analysis. With the two profiling (like traditional OTUs table putting mulitple samples together), the researchers could do some basic analysis regarding questions, for example, who are there and what they can do, even the taxonomic profiling is limited by reference database for environmental samples. There are also a lot of downstream analysis R or python packages with basic data designed as OTU tables.
Hello again @cam315
For taxonomy tables, I don't understand how do you want to get an taxonomy if not trough MAGs or trough marker genes?
If you want a table for functions, (without the genome information) you could run Atlas with or without the genome_binning option.
You should get a {sample}/{sample}_annotations.txt
contig_id | locus_tag | ftype | Length | gene | EC_number | product | orf_taxonomy | taxonomy | erfc | count |
---|---|---|---|---|---|---|---|---|---|---|
FR5_0 | FR5_00001 | CDS | 663 | hypothetical protein | 23 | |||||
FR5_1 | FR5_00002 | CDS | 822 | hypothetical protein | 45 | |||||
FR5_2 | FR5_00003 | CDS | 393 | hypothetical protein | 78 | |||||
FR5_2 | FR5_00004 | CDS | 279 | hypothetical protein | 62 |
There we have the counts, at least for each sample.
Now to combine different samples,
one could combine all the genes predicted in {sample}/annotation/prokka/{sample}.fna and generate a gene catalog for your dataset. Then we could quantify the genes in each sample.
Otherwise one could combine all contigs and predict genes from them and run them trough the annotation and quantification steps.
This does however not adress the annotation, the prokka annotation are quite limited. I tought about eggNOG or OrthoDB, which seems to much larger.
@SilasK Following your instruction for functional profiling, if someone want to merge the according *_annotation.txt from different samples together, a column should be specify for merging, for example, in R. In this case, contig_id ,locus_tag, product, taxonomy are not suitable. What will be your comment? Thank you.
I don't know, I see there no comparable id. taxonomy would be eventually possible, but it's not defined for all genes.
Why aren't they suitable? Maybe not suitable to your research question. Prokka annotations are definitely incomplete, but it does get you a simple way to compare across samples. You have EC number which you can then link to KEGG pathways and you have quantities associated per EC. You can do the same with taxonomy, product, etc. Totally depends on the question you're trying to answer.
Yes it’s true you could focus on the subset with EC numbers. I meant that the locus tags can't be ported between samples, something I'd like to solve with #90
The problem is that either EC or taxonomy are not always assigned (ie, is missing) for some rows, which will make the merging difficult. On the other hand, contig_ids and locus_tag are unique and present, by this could not be used for multiple samples (for there are not shared by different samples). Any suggestions?
We are working on it #90
But bare in mind that even in the best case, not all reads get aligned, and there is missing data.
@SilasK Then we can first assign these missing elements as 'unclassified' and merge the table afterwards, right?
@cam315 I implemented a gene quantification in the #90 branch. If you want to try it.
@SilasK Thank you for your quick feedback. I would like to try.
To test my implementation you should change to the 'combine_contig' branch:
Download the repro grom github:
git clone https://github.com/pnnl/atlas.git
change to the development branch:
git checkout <branch>
Go in the environment where you could call atlas
e.g. source activate atlas
I suggest directly to access directly the Snakefile. I din’t finalised everything in the setup.py.
instead of atlas assemble -o {outdir} config.yaml <params>
use:
snakemake -s path/to/atlas/atlas/Snakefile --use-conda --config workflow="complete" \
-d {outdir} \
--configfile config.yaml \
--config combine_contigs=True \
<params>
see in the subfolder contigs
This works for all contigs (in my test data, I had only one)
Geneid | Chr | Start | End | Strand | Length |
---|---|---|---|---|---|
1_1 | C_0 | 3 | 3368 | + | 3366 |
1_2 | C_0 | 3375 | 4880 | + | 1506 |
1_3 | C_0 | 4867 | 5661 | + | 795 |
1_4 | C_0 | 5636 | 6121 | + | 486 |
Geneid | S1 | S2 |
---|---|---|
1_1 | 279.5 | 279.5 |
1_2 | 144 | 144 |
1_3 | 85.5 | 85.5 |
1_4 | 52 | 52 |
I advanced faster than I thought.
I managed to make a new version of the pipeline. It should also output the taxonomy of the genes. However I don’t know how to aggregate them.
You can test it directly using the mags
branch (see above)
Apparently there is only one high quality contig in the cami_high data and here are the output of it.
contig_id | locus_tag | ftype | Length | gene | EC_number | product | orf_taxonomy | taxonomy | erfc | count |
---|---|---|---|---|---|---|---|---|---|---|
C_0 | metagenome_00001 | CDS | yteP | putative multiple-sugar transport system permease YteP | Ruminiclostridium sp. KB18 | kBacteria;pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gRuminiclostridium;s__Ruminiclostridium sp. KB18 | 0.09589097 | 606 | ||
C_0 | metagenome_00002 | CDS | araQ | L-arabinose transport system permease protein AraQ | Ruminiclostridium sp. KB18 | kBacteria;pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gRuminiclostridium;s__Ruminiclostridium sp. KB18 | 0.09589097 | 906 | ||
C_0 | metagenome_00003 | CDS | hypothetical protein | 387 |
Geneid | S002 | S004 | S005 |
---|---|---|---|
metagenome_00001 | 342.5 | 70 | 109.5 |
metagenome_00002 | 427 | 92.5 | 158.5 |
metagenome_00003 | 259.5 | 58.5 | 68 |
@SilasK following your instruction, I have tested the 'mags' branch. During the step of 'rule analyse_whole_metagenome', the input file can not be generated even I increased the latency-wait to 300, 600 and then to 900 seconds. It might depend on something beyond the capacity of computer workstation. Could you please optimize the script so that the pipeline can go ahead? info as follows:
rule analyse_whole_metagenome: input: contigs/combined_contigs.fasta output: annotations/metagenome/metagenome_contigs.fasta jobid: 158
Waiting at most 600 seconds for missing files. MissingOutputException in line 304 of /home/syang/atlas/atlas/rules/combine_contigs.snakefile: Missing files after 600 seconds: annotations/metagenome/metagenome_contigs.fasta This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
In the last version I corrected this small bug.
Could you again download it again. Sory for the inconveniences.
On Mar 2, 2018, 10:02, at 10:02, cam315 notifications@github.com wrote:
@SilasK following your instruction, I have tested the 'mags' branch. During the step of 'rule analyse_whole_metagenome', the input file can not be generated even I increase the latency-wait to 600 seconds. It may depend on the capacity of computer workstation. Could you please optimize the script so that the pipeline can go ahead? info as follows:
rule analyse_whole_metagenome: input: contigs/combined_contigs.fasta output: annotations/metagenome/metagenome_contigs.fasta jobid: 158
Waiting at most 600 seconds for missing files. MissingOutputException in line 304 of /home/syang/atlas/atlas/rules/combine_contigs.snakefile: Missing files after 600 seconds: annotations/metagenome/metagenome_contigs.fasta This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pnnl/atlas/issues/87#issuecomment-369862195
Dear Silas,
Last time, I consulted you about the taxonomic and functional profiling for metagenomic data. After check the several papers and blocks, I want to try metaphlan2 and humann2 to get taxonomic and functional profile, respectively. I am not sure whether you are familiar with the two pipeline, my question will be very general before running the two pipeline.
All my data is paired-end data, while it appears that both metaphlan2 and humann2 do not support paired-end analysis. For this step, I would like to use atlas to merge each to to on paired fastq file. In this case, shall what should I specify in atlas assemble -until?
In addition, I am not sure which fastq is the best for taxonomic and functional profile, shall I use the data generated after 1) quality control (quality controlled reads) or 2) assembly? I don't have clue but feel that it will be better to use the merged fastq file at least after quality control. Can you give me more instruction in this point? To get this file, what parameter shall I further specify in atlas assemble flow?
Looking forward to hearing from you.
Thank you and best regards,
cam315