See: https://portal.nersc.gov/MGV/
Viral detection pipeline: Identify viral contigs >=1Kb using the pipeline described in the manuscript
Quality control: Identify and remove putative host regions flanking viral contigs. Quantify genome completeness and apply genome quality standards.
Cluster genomes based on ANI: Average nucleotide identity (ANI) code and centroid based clustering. Used to identify species-level viral clusters
Cluster genomes based on AAI. Average amino acid identity (AAI) code and MCL based clustering. Used to identify genus-level and family-level viral clusters
Create SNP phylogenetic trees. Identify SNPs in core-genome regions based on whole-genome alignments. Build phylogenetic tree based on SNPs. Used in manuscript to create strain-level phylogenies for species-level viral clusters
Create marker-gene phylogenetic trees. Identify prevalent single-copy genes in a viral clade. Use concatenated gene alignments to build phylogenetic tree.
Identify CRISPR spacers. Identify CRISPR spacers using CRT and PILERCR, merge redundant CRISPR arrays, and format output.
For any other code/analysis inquiries, please open a github issue. Note: most of these scripts were written for Python 2. If you get an error using Python 3, try re-running with Python 2.
If this code is useful, please cite: Nayfach et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. 2021. https://www.nature.com/articles/s41564-021-00928-6.