metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
368 stars 97 forks source link

taxonomic profiles #87

Closed SilasK closed 5 years ago

SilasK commented 6 years ago

Dear Silas,

Last time, I consulted you about the taxonomic and functional profiling for metagenomic data. After check the several papers and blocks, I want to try metaphlan2 and humann2 to get taxonomic and functional profile, respectively. I am not sure whether you are familiar with the two pipeline, my question will be very general before running the two pipeline.

All my data is paired-end data, while it appears that both metaphlan2 and humann2 do not support paired-end analysis. For this step, I would like to use atlas to merge each to to on paired fastq file. In this case, shall what should I specify in atlas assemble -until?

In addition, I am not sure which fastq is the best for taxonomic and functional profile, shall I use the data generated after 1) quality control (quality controlled reads) or 2) assembly? I don't have clue but feel that it will be better to use the merged fastq file at least after quality control. Can you give me more instruction in this point? To get this file, what parameter shall I further specify in atlas assemble flow?

Looking forward to hearing from you.

Thank you and best regards,

cam315

SilasK commented 6 years ago

First, here is an article which explains the fundamental difference between read-based tools like Humans and assembly based tools like Atlas: http://www.nature.com/doifinder/10.1038/nbt.3935

I hope you don’t mind that the Human database is constructed for human microbiota and not updated since 5 years. Anyway I suggest to check how many reads human cannot annotate ‘unclassified’.

Humann and metaphlan ignore paired end reads and treat use only single end reads, that’s right. I don’t think the default mapping parameters are optimised for merged reads, e.g. what is the overlap they require between read and gene?

I would’t merge them and just use them as single end reads, so you will have two times more. You can use ‘atlas QC myconfig.yaml’ to get the QC_reads and concatenate them.

Kind regards Silas

SilasK commented 6 years ago

Dear Silas,

Thank you very much for your prompt reply. I tested one sample with humann2 and metaphlan2. As you mentioned, these two pipelines worked very bad with the environmental dataset, the taxonomical and functional profiling missed 99% and 98% of total QC reads. That is far out of my expectation. The atlas assinged these reads far higher in this aspect. Frankly, I am new in analyzing metagenomic data and get confused at the current stage. I think it is very important to choose a right pipeline in the beginning. To generate the taxonomic and functional profile for environmental sample, what tools or pipelines do you think are more appropriate for environmental samples? You have mentioned kraken, then I will try it soon. In addition, do you think kallisto will fit? Has atlas ever considered to include the two blocks?

Looking forward to your instruction. Thank you and best regards,

cam315

SilasK commented 6 years ago

Welcome to the club.

I also work on poorly characterised environments. I don’t think that neither Kraken nor Kallisto, will help you much because they both depend on databases. And If there are not many genomes in gene bank this two tools won’t classify much reads.

For environments lacking genome references, people have often used assembly based metagenomics (see article from last mail). Using assembly based metagenomics one tries to reconstruct “meta genome assembled genomes” (MAGs), which are then more or less complete and more or less new or closely related to existing taxonomies.

As I see it Atlas is a pipeline combining (the best) state of the art tools to reconstruct genomes from meta-genomes. However Atlas doesn’t produce output tables combining the results from different samples, rather it outpus tables with gene annotation for each MAG, specific too each sample.

Have you already run Atlas until the end? As I’m also interested in such summary tables I will try to help you. What exactly are you looking for?

How do you expect a taxonomy to look like if you have newly identified species in your sample?

cam315 commented 6 years ago

Yes, I've tried running Atlas until the end and have got a bunch of data in separate samples. I would assume that the assembly-based MAGs are advanced analysis, while the taxonomic and functional profile are basic analysis. With the two profiling (like traditional OTUs table putting mulitple samples together), the researchers could do some basic analysis regarding questions, for example, who are there and what they can do, even the taxonomic profiling is limited by reference database for environmental samples. There are also a lot of downstream analysis R or python packages with basic data designed as OTU tables.

SilasK commented 6 years ago

Hello again @cam315

For taxonomy tables, I don't understand how do you want to get an taxonomy if not trough MAGs or trough marker genes?

If you want a table for functions, (without the genome information) you could run Atlas with or without the genome_binning option.

You should get a {sample}/{sample}_annotations.txt

contig_id locus_tag ftype Length gene EC_number product orf_taxonomy taxonomy erfc count
FR5_0 FR5_00001 CDS 663     hypothetical protein     23
FR5_1 FR5_00002 CDS 822     hypothetical protein     45
FR5_2 FR5_00003 CDS 393     hypothetical protein     78
FR5_2 FR5_00004 CDS 279     hypothetical protein     62

There we have the counts, at least for each sample.

Now to combine different samples,

  1. one could combine all the genes predicted in {sample}/annotation/prokka/{sample}.fna and generate a gene catalog for your dataset. Then we could quantify the genes in each sample.

  2. Otherwise one could combine all contigs and predict genes from them and run them trough the annotation and quantification steps.

This does however not adress the annotation, the prokka annotation are quite limited. I tought about eggNOG or OrthoDB, which seems to much larger.

ghost commented 6 years ago

@SilasK Following your instruction for functional profiling, if someone want to merge the according *_annotation.txt from different samples together, a column should be specify for merging, for example, in R. In this case, contig_id ,locus_tag, product, taxonomy are not suitable. What will be your comment? Thank you.

SilasK commented 6 years ago

I don't know, I see there no comparable id. taxonomy would be eventually possible, but it's not defined for all genes.

brwnj commented 6 years ago

Why aren't they suitable? Maybe not suitable to your research question. Prokka annotations are definitely incomplete, but it does get you a simple way to compare across samples. You have EC number which you can then link to KEGG pathways and you have quantities associated per EC. You can do the same with taxonomy, product, etc. Totally depends on the question you're trying to answer.

SilasK commented 6 years ago

Yes it’s true you could focus on the subset with EC numbers. I meant that the locus tags can't be ported between samples, something I'd like to solve with #90

ghost commented 6 years ago

The problem is that either EC or taxonomy are not always assigned (ie, is missing) for some rows, which will make the merging difficult. On the other hand, contig_ids and locus_tag are unique and present, by this could not be used for multiple samples (for there are not shared by different samples). Any suggestions?

SilasK commented 6 years ago

We are working on it #90

But bare in mind that even in the best case, not all reads get aligned, and there is missing data.

ghost commented 6 years ago

@SilasK Then we can first assign these missing elements as 'unclassified' and merge the table afterwards, right?

SilasK commented 6 years ago

@cam315 I implemented a gene quantification in the #90 branch. If you want to try it.

ghost commented 6 years ago

@SilasK Thank you for your quick feedback. I would like to try.

SilasK commented 6 years ago

To test my implementation you should change to the 'combine_contig' branch:

Download the repro grom github:

git clone https://github.com/pnnl/atlas.git

change to the development branch:

git checkout <branch>

Go in the environment where you could call atlas e.g. source activate atlas

I suggest directly to access directly the Snakefile. I din’t finalised everything in the setup.py.

instead of atlas assemble -o {outdir} config.yaml <params> use:

snakemake -s path/to/atlas/atlas/Snakefile --use-conda --config workflow="complete" \
    -d {outdir} \
    --configfile config.yaml \
    --config combine_contigs=True \
<params>

Results

see in the subfolder contigs

This works for all contigs (in my test data, I had only one)

combined_gene_info.tsv

Geneid Chr Start End Strand Length
1_1 C_0 3 3368 + 3366
1_2 C_0 3375 4880 + 1506
1_3 C_0 4867 5661 + 795
1_4 C_0 5636 6121 + 486

combined_gene_counts.tsv

Geneid S1 S2
1_1 279.5 279.5
1_2 144 144
1_3 85.5 85.5
1_4 52 52
SilasK commented 6 years ago

I advanced faster than I thought.

I managed to make a new version of the pipeline. It should also output the taxonomy of the genes. However I don’t know how to aggregate them.

You can test it directly using the mags branch (see above)

Apparently there is only one high quality contig in the cami_high data and here are the output of it.

Annotations

anotations/metagenome/annotations.txt

contig_id locus_tag ftype Length gene EC_number product orf_taxonomy taxonomy erfc count
C_0 metagenome_00001 CDS   yteP   putative multiple-sugar transport system permease YteP Ruminiclostridium sp. KB18 kBacteria;pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gRuminiclostridium;s__Ruminiclostridium sp. KB18 0.09589097 606
C_0 metagenome_00002 CDS   araQ   L-arabinose transport system permease protein AraQ Ruminiclostridium sp. KB18 kBacteria;pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gRuminiclostridium;s__Ruminiclostridium sp. KB18 0.09589097 906
C_0 metagenome_00003 CDS       hypothetical protein     387

anotations/metagenome/gene_conts.tsv

Geneid S002 S004 S005
metagenome_00001 342.5 70 109.5
metagenome_00002 427 92.5 158.5
metagenome_00003 259.5 58.5 68
cam315 commented 6 years ago

@SilasK following your instruction, I have tested the 'mags' branch. During the step of 'rule analyse_whole_metagenome', the input file can not be generated even I increased the latency-wait to 300, 600 and then to 900 seconds. It might depend on something beyond the capacity of computer workstation. Could you please optimize the script so that the pipeline can go ahead? info as follows:

rule analyse_whole_metagenome: input: contigs/combined_contigs.fasta output: annotations/metagenome/metagenome_contigs.fasta jobid: 158

Waiting at most 600 seconds for missing files. MissingOutputException in line 304 of /home/syang/atlas/atlas/rules/combine_contigs.snakefile: Missing files after 600 seconds: annotations/metagenome/metagenome_contigs.fasta This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

SilasK commented 6 years ago

In the last version I corrected this small bug.

Could you again download it again. Sory for the inconveniences.

On Mar 2, 2018, 10:02, at 10:02, cam315 notifications@github.com wrote:

@SilasK following your instruction, I have tested the 'mags' branch. During the step of 'rule analyse_whole_metagenome', the input file can not be generated even I increase the latency-wait to 600 seconds. It may depend on the capacity of computer workstation. Could you please optimize the script so that the pipeline can go ahead? info as follows:

rule analyse_whole_metagenome: input: contigs/combined_contigs.fasta output: annotations/metagenome/metagenome_contigs.fasta jobid: 158

Waiting at most 600 seconds for missing files. MissingOutputException in line 304 of /home/syang/atlas/atlas/rules/combine_contigs.snakefile: Missing files after 600 seconds: annotations/metagenome/metagenome_contigs.fasta This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Will exit after finishing currently running jobs. Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pnnl/atlas/issues/87#issuecomment-369862195