snayfach / IGGdb

Database of genomes integrated from the gut microbiome and other environments
GNU General Public License v3.0
44 stars 9 forks source link

IGGdb: integrated genomes from the gut microbiome and other environments

We constructed the IGGdb using data from a number of domains, including PATRIC, IMG, and MAGs assembled from 3,810 publicly available human gut metagenomes from the NCBI SRA. If this repository is useful, please consider additionally citing the individual sources and studies the data was derived from.

Note that many of the genomes in this repository were assembled from metagenomes. Lower quality genomes and MAGs need to be treated with special care to avoid issues with missingness, contamination, and short contigs. When in doubt, restrict your analysis to the high-quality genome sets.

After download, all datasets can be unpacked using: tar -xjvf <dataset_name>

Human Gut MAG (HGM) dataset

MAGs were assembled from 3,810 human gut metagenomes from 15 different studies from geographically and phenotypically diverse human subjects. Metagenomes were assembled with MegaHIT and MAGs were constructed by binning contigs per-sample on the basis of nucleotide composition and read-depth. This was performed using four existing tools: Maxbin, MetaBAT, CONCOCT, DAS Tool. MAGs were screened for contamination using a custom pipeline.

High quality MAGs (N=24345)
download from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/HGM_v1.0_hq_24345_fna.tar.bz2

High and medium quality MAGs (N=60664)
download from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/HGM_v1.0_all_60664_fna.tar.bz2

Integrated genomes from the gut and other environments dataset (IGG)

Note: this set includes non-gut genomes

The 60,664 genomes from the HGM dataset were integrated together with 145,917 reference genomes from PATRIC and IMG, which include 16,525 publicly available MAGs from other studies as well as genomes from other non-gut environments . All 206,581 genomes met the MIMAG medium quality draft genome standard of >=50% completeness and <=10% contamination. Genomes were clustered into 23,790 species-level OTUs based on 95% genome-wide average nucleotide identity.

Representative genomes for all species (N=23,790)
download from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/IGG_v1.0_all_23790_fna.tar.bz2

Representative genomes for species with a high-quality genome (N=16,136)
download via from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/IGG_v1.0_hq_16136_fna.tar.bz2

Representative genomes for human gut species (N=4,558)
download from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/IGG_v1.0_gut_4558_fna.tar.bz2

Representative genomes for human gut species with a high-quality genome (N=2,935)
download from your browser: download link
download via wget: wget https://portal.nersc.gov/cfs/m342/HGM/IGG_v1.0_gut_2935_fna.tar.bz2

Phylogenome trees

Phylogenetic trees were constructed for all Bacterial and Archaeal species in the IGGdb using concatenated alignments of conserved, single-copy marker gene families from the PhyEco database (N=88 for Bacteria and 100 for Archaea). Protein-based multiple sequence alignment was performed using FAMSA v1.2.5, which is designed for fast and accurate alignment of thousands of sequences. Gene alignments were concated together, columns with >15% gaps were dropped, and seuqneces with >70% gaps and were removed (N=39). FastTree2 was used to build a maximum likelihood phylogeny.

All Bacterial species (N=22,515)
download alignments
download tree

All Archaeal species (N=1,236)
download alignments
download tree

Metadata for the HGM and IGG datasets

Reference genomes and MAGs from HGM dataset (N=206,581)
download link

All species from the IGGdb (N=23,790)
download link