full names of mOTU.v1.padded reference sequences

Jigyasa3 commented 5 years ago

Is it possible to get the real names of the sequences in mOTU.v1.padded reference sequence file? For example, I want full names of the fasta headers-

1048834.TC41_3292 101 1561 158499257-stool1_revised_scaffold30828_1_gene95771 101 1546 596312.HMPREF0569_0078 101 1519 MH0062_revised_scaffold4351_12_gene25053 101 1483 etc.. I have checked this file- "All_2481_at_least_500.motu.nr.out.20180307.tsv". But it doesn't match.
Which version of the mOTU.padded file was used to train the fetchM perl script to extract the marker genes?

AlessioMilanese commented 5 years ago

mOTUs version 1 refers to this Nature Methods paper and the official website is here. This github repository is for mOTUs version 2.

1. Is it possible to get the real names of the sequences in mOTU.v1.padded reference sequence file?

The genes in mOTUs v1 are extracted from 252 Human fecal samples. First you assemble the reads into contigs and then you extract the genes (with Prodigal for example) and finally you run fetchMG to detect the 40 marker genes. I think the contigs (hence the "real names" that you refer to) are not available. You can probably understand the metagenomic sample from which the genes were extracted (see for example 158499257-stool1_revised_scaffold30828_1_gene95771 where 158499257-stool1 match to the Supplementary Table 3 of the paper).

I have checked this file- "All_2481_at_least_500.motu.nr.out.20180307.tsv". But it doesn't match.

This file (and in general all the github directory) refers to mOTUs version 2 and not version 1.

2. Which version of the mOTU.padded file was used to train the fetchM perl script to extract the marker genes?

fetchMG was trained on a manually curated alignment of reference genomes downloaded from NCBI. See the Nature Methods Online methods for more information:

Identification of single-copy marker genes. Profile hidden Markov models (HMMs) were generated using the hmmbuild and hmmsearch programs of HMMER26 (v3) for 40 universal single-copy MGs8,9 based on multiple-sequence alignments of their orthologous groups that had been previously identified in 1,497 prokaryotic genomes17,27,28. Prokaryotic reference genome sequences were downloaded (February 2012) from the US National Center for Biotechnology Information (NCBI) genomes database. As a quality filter, we removed complete genomes with less than 30 MGs and genomes with more than 500 contigs, yielding a set of 3,496 reference genomes (Supplementary Table 1). For these genomes, we used hmmsearch with a bit score cutoff of 60 to identify 138,132 MGs (39.5 per genome) in >11 million proteins by selecting the highest-scoring target sequence (best hit) for each MG in each genome (Supplementary Table 2). Compared to a BLAST-based annotation of the same set of proteins, the HMM–based procedure was four orders of magnitude faster in terms of computing time (134,000 versus 17.5 CPU h). In addition to increasing the speed of MG identification, it is important to minimize the FDR when extending these search methods to metagenomic data, because selecting the best hit (as done for reference genomes) is not possible and a bit score threshold must be used instead. For example, when simply using HMMs with a cutoff of 60 bits on the set of 3,496 genomes, 15.7% more genes that are likely false positives were identified compared to the best hits only (Supplementary Table 2). Thus, we calibrated MG-specific bit-score cutoffs (Supplementary Table 2) by maximizing the accuracy (F score) of MG identification using a training set of 1,004 well-annotated prokaryotic genomes that are available in the eggNOG (evolutionary genealogy of genes: Nonsupervised Orthologous Groups; v3.0) database28. When repeating the search using these calibrated cut offs, the increase in the number of identified sequences compared to selecting the best hit only was 3.3% (expected, 39.51 MGs per genome; observed, 40.81 MGs) for all 40 MGs. This was reduced to 1.0% (expected, 38.52 per genome; observed, 38.90 MGs) when COG0085 was excluded, which alone accounted for 71% of all putative false positives (Supplementary Table 2).

Jigyasa3 commented 5 years ago

Thank you so much for your reply! I was wondering which metagenome sequences were used to generate version 2 of mOTU? And where can I find the mOTU.v.2.padded fasta sequences?

AlessioMilanese commented 5 years ago

And where can I find the mOTU.v.2.padded fasta sequences?

You can download the database from zenodo. Inside, you can find mOTU.v2b.nr.padded, a fasta file with the gene sequences of the 10 marker genes used for mOTUs.

The header gives you information on the gene, example:

>metaMG0016206.COG0215 101 1504

>meta means that it is extracted from metagenomic samples, while >ref means from reference genomes (ProGenomes) Then 101 and 1504 refers to the padding, hence if you substring the gene sequence with those values, you get the gene sequence.

AlessioMilanese commented 5 years ago

which metagenome sequences were used to generate version 2 of mOTU?

Most samples were obtained from human microbiome studies, including 1,210 samples from different major human body sites (oral, skin, gut and vaginal [14, 15] and 1,693 further samples from various human gut microbiome studies [16, 17, 18, 19, 20, 21]. In addition, we used 243 metagenomic samples from marinethe ocean environments [22].

[14] Human Microbiome Project, C. Structure, function and diversity of the healthy human microbiome. Nature 486, 207-214 (2012). [15] Lloyd-Price, J., et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 550, 61-66 (2017). [16] Feng, Q., et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015). [17] Karlsson, F. H., et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99-103 (2013). [18] Qin, J., et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59-65 (2010). [19] Qin, J., et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55-60 (2012). [20] Voigt, A. Y., et al. Temporal and technical variability of human gut metagenomes. Genome Biol. 16, 73 (2015). [21] Zeller, G., et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014). [22] Sunagawa, S., et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).

mOTUs version 2 will be soon available on Nature Communication, you will find more information there. I can keep you updated when the paper will be available!

Jigyasa3 commented 5 years ago

Thank you so much for all the information!

AlessioMilanese commented 5 years ago

No problem. Let me know if you have any other question. I'm closing the issue.

motu-tool / mOTUs

full names of mOTU.v1.padded reference sequences #18