MAGnet

MAGnet is a computational tool that uses reference based method to refine taxonomic assignment and provide accurate abundance estimation. This pipeline handles reference genome selection, download using NCBI Datasets, read alignment using Minimap2, and breadth and depth of genome coverage calculations.

Dependencies

Python 3.9
NCBI Datasets v15.27.1
Minimap2 v2.24-r1122
Samtools v1.15.1
Biopython
Pandas
Ete3 v3.1.2
BWA v0.7.17

Getting started

usage: magnet.py [-h] -c CLASSIFICATION -i FASTQ [-I FASTQ2] [-m {ont,illumina}] -o OUTPUT [-t TAXID_IDX]
                 [-a ABUNDANCE_IDX] [--min-abundance MIN_ABUNDANCE] [--min-mapq MIN_MAPQ]
                 [--min-covscore MIN_COVSCORE] [--threads THREADS] [--include-mag] [--subspecies]

Universal Taxonomic Classification Verifier.

optional arguments:
  -h, --help            show this help message and exit
  -c CLASSIFICATION, --classification CLASSIFICATION
                        Path to the Taxonomic Classification Report. Accepting csv/tsv file format, other text formats
                        are treated as tsv.
  -i FASTQ, --fastq FASTQ
                        Path to the first fastq file.
  -I FASTQ2, --fastq2 FASTQ2
                        Path to the second fastq file for paired-end reads.
  -m {ont,illumina}, --mode {ont,illumina}
                        Modes for different sequencing platforms [ont, illumina]. Default:[ont]
  -o OUTPUT, --output OUTPUT
                        Path to the output directory.
  -t TAXID_IDX, --taxid-idx TAXID_IDX
                        The column index (0-based) of the taxids. Default:[0]
  -a ABUNDANCE_IDX, --abundance-idx ABUNDANCE_IDX
                        The column index (0-based) of the abundance. Default:[None]
  --min-abundance MIN_ABUNDANCE
                        Minimum abundance (0-1) for pre-filtering, exclude taxa below the threshold.
  --min-mapq MIN_MAPQ   Minimum MAPQ for primary alignments. Default:[20]
  --min-covscore MIN_COVSCORE
                        Minimum Coverage Score for supplementary alignments. Default:[0.7]
  --threads THREADS     Number of threads for Multi-threading. Default:[1]
  --include-mag         Include metagenomic assemble genomes. Default:[False]
  --subspecies          Verify taxonomic classification at subspecies rank. Default:[False]

Example of Using Lemur Report as Input for ONT datasets

Since Lemur report uses a format that taxid is the first column, we set '-t' to '0', which is already the default setting, and can be omit. Using the following command to run Magnet.

magnet.py -c {lemur_report_file} -i {input_fastq_file} -o {output_path} -m ont

Example of Using Kraken2 Report as Input for ONT datasets

Since kraken2 report uses a format that taxid is the fifth column, we set '-t' to '4', and since the abundance estimation is located in the first column, we set '-a' to '0'. To filter out potiential noise at low abundance, we could set '--min-abundance' to 0.001. Therefore, we could use the following command.

magnet.py -c {kraken2_report_file} -i {input_fastq_file} -o {output_path} -m ont -t 4 -a 0 --min-abundance 0.001

Example Output

The following table is an example of the output csv file 'cluster_representative.csv'.

Taxonomy ID	Assembly Accession ID	Source Database	Is Representative	Assembly Level	Organism of Assembly	Strain	Total Length	Downloaded	Species	Cluster Representative	Cluster Members	Secondary Breadth	Secondary Expected	Secondary Score	Secondary Depth	Primary Breadth	Primary Expected	Primary Score	Primary Depth	Consensus ANI	Combined PS and ANI (Sqrt(ANI)xPSx100)	Presence/Absence
2678528	GCF_028867355.1	SOURCE_DATABASE_REFSEQ	False	Complete Genome	Listeria sp. LM90SB2	LM90SB2	2915834.0	True	Listeria sp. LM90SB2	True	GCF_028867355.1,GCF_000196035.1	1.0	1.0	1.0	3195.6499999657044	1.0	1.0	1.0	3188.0799997530726	0.9957277140090639	99.79	Present
96241	GCF_006094475.1	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Bacillus spizizenii ATCC 6633 = JCM 2499	ATCC 6633	4045538.0	True	Bacillus spizizenii	True	GCF_006094475.1	0.9999987640704401	0.9999999999999607	0.9999987640704794	30.85103792256793	0.9999987640704401	0.999999999999958	0.999998764070482	30.786837976602836	0.9952862581301721	99.76	Present
1613	GCF_029961225.1	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Limosilactobacillus fermentum	EFEL6800	2103331.0	True	Limosilactobacillus fermentum	True	GCF_029961225.1	0.17646569702852932	0.1786636068369798	0.9876980553154514	1.1147205952452701	0.17275099611416372	0.17304955697814606	0.9982747088799537	1.0992270709852185	0.853534466277766	92.23	Present
287	GCF_000006765.1	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Pseudomonas aeruginosa PAO1	PAO1	6264404.0	True	Pseudomonas aeruginosa	True	GCF_024714315.1,GCF_000006765.1	0.9736496879830867	1.0	0.9736496879830867	133.57679009269043	0.9736029157761855	1.0	0.9736029157761855	133.2863715317914	0.9790992928162352	96.34	Present
4932	GCF_000146045.2	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Saccharomyces cerevisiae S288C	S288C	12071326.0	True	Saccharomyces cerevisiae	True	GCF_000146045.2	0.9702460413067091	0.9988842650498445	0.9713297878991953	7.005664504492986	0.9551612822296097	0.9987129198063666	0.9563922357335672	6.96673509016021	0.8611351208262882	88.75	Present
28901	GCF_000006945.2	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Salmonella enterica subsp. enterica serovar Typhimurium str. LT2	LT2	4951383.0	True	Salmonella enterica	True	GCF_000006945.2,GCF_014492125.1,GCF_022556855.1,GCF_014491785.1	0.7829200506438564	0.8320860674724839	0.9409123421835766	2.2776524817939543	0.7817768582280826	0.8295483061281343	0.9424126991193291	2.2617706866885654	0.9299024245256468	90.88	Present
562	GCF_000008865.2	SOURCE_DATABASE_REFSEQ	True	Complete Genome	Escherichia coli O157:H7 str. Sakai	Sakai substr. RIMD 0509952	5594605.0	True	Escherichia coli	True	GCF_000008865.2	0.6591498747494352	0.7795045795270928	0.8456010292451726	2.2926196898951465	0.6569027483105632	0.7772529211094938	0.8451595747917728	2.284965162259128	0.8996365155332352	80.16	Present
1913579	GCF_001875785.1	SOURCE_DATABASE_REFSEQ	False	Scaffold	Bacillus sp. FMQ74	FMQ74	4161595.0	True	Bacillus sp. FMQ74	True	GCF_001875785.1	0.003557530225790833	0.03775324221474807	0.0942311180998676	10.81661600810537	0.002592275317516481	0.02381212312913672	0.10886367853291296	9.295791620318873	0.905627438899158	10.36	Absent

The table contains the following information on the reference genome that are being selected for alignment:

Taxonomy ID: The taxonomy ID of the species that users queries
Assembly Accession ID: The assembly accession ID of the genome that is being selected as reference genome for the alignment process
Source Database: source of the assembly, part of the metadata from NCBI Database
Is Representative: Whether or not the assembly is labeled as representative genome in NCBI database
Assembly Level: the assembly level of the assembly, can be one of the following: complete genome, chromosome, contig, scaffold
Organism of Assembly: The description of the assembly, may contain strain level information
Strain: The specific strain used for alignment
Species: taxonomy name of the species
Cluster Representative: Whether the genome is a cluster representative after ANI clustering
Cluster Members: Members of the same cluster

After the reference genomes are being selected, the pipeline create a reference in multi-fasta format by concatenate the reference genomes, and aligns the reads (in fasta format) to this concatenated reference, and calculated the following:

Secondary Breadth Coverage: The proportion of the genome has reads aligned, including secondary alignment and alignments with low MAPQ.
Secondary Expected Coverage: An estimation of the expected breadth of coverage based on the assumption that the alignment of the read is randomly distributed along the entire reference genome, excluding secondary alignment and alignments with low MAPQ.
Secondary Coverage Score: Secondary Breadth Coverage / Secondary Expected Coverage
Secondary Depth Coverage: The mean depth of coverage, calculated with aligned regions only, excluding secondary alignment and alignments with low MAPQ.
Primary Breadth Coverage: The proportion of the genome has reads aligned, excluding secondary alignment and alignments with low MAPQ.
Primary Expected Coverage: An estimation of the expected breadth of coverage based on the assumption that the alignment of the read is randomly distributed along the entire reference genome, excluding secondary alignment and alignments with low MAPQ.
Primary Coverage Score: Primary Breadth Coverage / Primary Expected Coverage
Primary Depth Coverage: The mean depth of coverage, calculated with aligned regions only, excluding secondary alignment and alignments with low MAPQ.
Consensus ANI: The ANI is calculated by comparing the consensus genome created by the primary alignment with the original reference genome. Unaligned regions are excluded.
Combined PS and ANI (Sqrt(ANI)xPSx100)
Presence/Absence Calls