Users should be able to either specify or not specify a reference genomes (preferably in a GenBank file) to perform reference mapping, variant calling and consensus sequence generation.
If reference genome(s) provided, select the top distinct reference genomes by Mash. It may be desirable to allow users to reference map against a set of distinct priority genomes for easy analysis of unknown metagenomic samples.
Calculate pairwise Mash distances between all reference genomes if more than 1 genome in the file
Split genomes into clusters based on matrix of Mash distances
Mash screen reads to select top reference genome in each cluster; remove clusters with no Mash screen results from further analysis
Map reads against each top reference genome from each cluster producing distinct consensus sequences
Phylogenetic tree construction would need to performed on a cluster by cluster basis since it won't be useful to show phylogenetic relationships between both closely related and very distantly related organisms.
If no reference genome(s) provided, use the Kraken2 and Centrifuge classification results to select appropriate reference genomes and download them from NCBI. Proceed with analysis as described above.
Issues:
Number of read thresholds? At least 10 reads classified.
Only pick reference genomes that fall under a specified taxonomic group? For example, Viruses?
Species level predictions only? If using genus level, then download all genomes belonging to genus?
Users should be able to either specify or not specify a reference genomes (preferably in a GenBank file) to perform reference mapping, variant calling and consensus sequence generation.
If reference genome(s) provided, select the top distinct reference genomes by Mash. It may be desirable to allow users to reference map against a set of distinct priority genomes for easy analysis of unknown metagenomic samples.
If no reference genome(s) provided, use the Kraken2 and Centrifuge classification results to select appropriate reference genomes and download them from NCBI. Proceed with analysis as described above.
Issues: