metagenome-atlas / atlas_analyze

Scripts to get the most of the output of metagenome-atlas
5 stars 1 forks source link

Question KO_tsv file and plot #5

Open Sofie8 opened 3 years ago

Sofie8 commented 3 years ago

Hi Silas,

I have a question regarding the output written to the 'results' folder: (1) KO.tsv Structure is: MAG K00001 K0002 etc.. MAG1 1 1

and the file written to the Genecatalog/annotations folder: (2) KO.tsv Structure is: Gene0014228 KO7304 Gene...

And the summary.html the last table and plot of the Kegg orthologs. I was trying to understand what I am seeing, or which file you load in:

Is this Kegg ortholog table and heatmap based based on your file (1), so it are the KEGG orthologs/mags/sample

Or do you read in the file (2), KEGG ortholog/gene/(somehow gene_abundance)/sample?

I wanted to do further downstream analyses with the genecatalogs file, but I don't know how you translate the query to its abundance (occurrence) in a sample.

I was thinking, can we make graphs in which we express the abundance of certain genes relative to a common gene in each sample. So we speak about ratio's, I think this reduces (statistically) a bit the dependency/skew of the data on its different throughput. Like rpoB as reference gene, and suppose the query I am interested in, is an oil degrading gene. So I can say in this groundwater monitoring well, I have 5 oil degrading targets/ 10 rpob (rpoB is common to all bacteria, so I can say either half of my bacteria population is oil degrader, or, one strain expresses 5 oil degrading genes, etc...). The samples can have different number of reads, that doesn't really matter than, as long as we take the ratio between strains in a sample?

Jackson did something similar with the outcome of metannotate and I was wondering if I can translate the outcome of your genecatalog file as input to his R script. I just need one extra column in the table saying to which sample, the gene belongs, then I can continue :-)

Sofie

SilasK commented 3 years ago

Hey Sophie,

You know that I’m an advocate of genome-centric analysis. The KO-gene are linked with gene-genome table and multiplied by the relative abundance of the genome to produce the KO-abundance table.

Now if you have produced a gene-abundance table (https://github.com/metagenome-atlas/atlas/issues/276) you can link them to the genne- KO table Sum KOs if you want.

Yes you can normalise by one ore the median of many single-copy-KO I think lat tame iI did this I took the list from here: 10.1186/s13059-015-0610-8

This would give you the results as gene-copies / genome.

Sofie8 commented 3 years ago

Hi Silas,

You know that I’m an advocate of genome-centric analysis.

Now if you have produced a gene-abundance table (metagenome-atlas/atlas#276) you can link them to the genne- KO table Sum KOs if you want.

To: GENE ID: 10oct3 5oct3 Annotation Gene00001 5 2 taxonomy, KO, KEGG, module, ... etc...

Like this I have an 'OTU table' kind of thing which I can use for downstream analyses, just like the OTU-table you have for the genomes, which I am using, but I am afraid I miss several genes part of 'partial' genomes. With the contigs, I come a step closer. Additionally, I complement this with something like Metannotate, performing your QC steps, and then without assembly, fraggenescan++ and then hmm search for certain genes.. then I feel like I have taken the most out of my shotgun dataset (for bacteria).

Thanks! Sofie

SilasK commented 3 years ago

Hey Sofie,

My idea behind atlas was to create a consistent reference to annotate all samples of a project. Either you do this with a collection of genomes or the genecatalog. (You could also take a database e.g. uniprot and map to or create a contig catalog, but this is not implemented in atlas. )

The advantages are, you don't need to annotate the same gene multiple times and you quantify the same genes in all samples.

You can annotate the gene catalog with what you want.

Once you have a table

gene annotation
gene1 K001
gene5 K002

You can combine it with the table "Genecatalog/counts/median_coverage.tsv.gz" #276

  sample1 sample2
gene1 0 10
gene2 5 0

You simply load the counts table, normalize the counts, and select the genes for which you have annotations. You may want to sum genes that have the same annotation.

I can do this easily in python, but it should also be easy in R.

Sofie8 commented 3 years ago

Hi Silas!

Ok I am looking, but I don't see a folder counts inside my genecatalog folder. How do I generate this median_coverage.tsv file please? (I get median_coverage_genomes.tsv if I run atlas analyze, but that's for the genomes).

Metannotate: Basically works like: you start from the QC merged reads, (no assembly) but you run fraggenescan++, and then hmm-search for the genes you want. Lee and Jackson are fixing it, cause with the GI number change the software broke. So basically you could have in atlas 3 'modules': no-assembly based gene centric, assembly-based gene centric, and assembly-based genome centric.

SilasK commented 3 years ago

Look at gene-abundance table (metagenome-atlas/atlas#276) for how to get the median coverage.

@jmtsuji I don't have experience with fraggenscann++ but what do you think of using PLASS from the mmseqs package. I think this is a state-of the art protein assembly that get all the genes from even a complicated metagenome.