treangenlab / magnet

7 stars 0 forks source link

MAGnet

MAGnet is a computational tool that uses reference based method to refine taxonomic assignment and provide accurate abundance estimation. This pipeline handles reference genome selection, download using NCBI Datasets, read alignment using Minimap2, and breadth and depth of genome coverage calculations.

Dependencies

Getting started

usage: magnet.py [-h] -c CLASSIFICATION -i FASTQ [-I FASTQ2] [-m {ont,illumina}] -o OUTPUT [-t TAXID_IDX]
                 [-a ABUNDANCE_IDX] [--min-abundance MIN_ABUNDANCE] [--min-mapq MIN_MAPQ]
                 [--min-covscore MIN_COVSCORE] [--threads THREADS] [--include-mag] [--subspecies]

Universal Taxonomic Classification Verifier.

optional arguments:
  -h, --help            show this help message and exit
  -c CLASSIFICATION, --classification CLASSIFICATION
                        Path to the Taxonomic Classification Report. Accepting csv/tsv file format, other text formats
                        are treated as tsv.
  -i FASTQ, --fastq FASTQ
                        Path to the first fastq file.
  -I FASTQ2, --fastq2 FASTQ2
                        Path to the second fastq file for paired-end reads.
  -m {ont,illumina}, --mode {ont,illumina}
                        Modes for different sequencing platforms [ont, illumina]. Default:[ont]
  -o OUTPUT, --output OUTPUT
                        Path to the output directory.
  -t TAXID_IDX, --taxid-idx TAXID_IDX
                        The column index (0-based) of the taxids. Default:[0]
  -a ABUNDANCE_IDX, --abundance-idx ABUNDANCE_IDX
                        The column index (0-based) of the abundance. Default:[None]
  --min-abundance MIN_ABUNDANCE
                        Minimum abundance (0-1) for pre-filtering, exclude taxa below the threshold.
  --min-mapq MIN_MAPQ   Minimum MAPQ for primary alignments. Default:[20]
  --min-covscore MIN_COVSCORE
                        Minimum Coverage Score for supplementary alignments. Default:[0.7]
  --threads THREADS     Number of threads for Multi-threading. Default:[1]
  --include-mag         Include metagenomic assemble genomes. Default:[False]
  --subspecies          Verify taxonomic classification at subspecies rank. Default:[False]

Example of Using Lemur Report as Input for ONT datasets

Since Lemur report uses a format that taxid is the first column, we set '-t' to '0', which is already the default setting, and can be omit. Using the following command to run Magnet.

magnet.py -c {lemur_report_file} -i {input_fastq_file} -o {output_path} -m ont

Example of Using Kraken2 Report as Input for ONT datasets

Since kraken2 report uses a format that taxid is the fifth column, we set '-t' to '4', and since the abundance estimation is located in the first column, we set '-a' to '0'. To filter out potiential noise at low abundance, we could set '--min-abundance' to 0.001. Therefore, we could use the following command.

magnet.py -c {kraken2_report_file} -i {input_fastq_file} -o {output_path} -m ont -t 4 -a 0 --min-abundance 0.001

Example Output

The following table is an example of the output csv file 'cluster_representative.csv'.

Taxonomy ID Assembly Accession ID Source Database Is Representative Assembly Level Organism of Assembly Strain Total Length Downloaded Species Cluster Representative Cluster Members Secondary Breadth Secondary Expected Secondary Score Secondary Depth Primary Breadth Primary Expected Primary Score Primary Depth Consensus ANI Combined PS and ANI (Sqrt(ANI)xPSx100) Presence/Absence
2678528 GCF_028867355.1 SOURCE_DATABASE_REFSEQ False Complete Genome Listeria sp. LM90SB2 LM90SB2 2915834.0 True Listeria sp. LM90SB2 True GCF_028867355.1,GCF_000196035.1 1.0 1.0 1.0 3195.6499999657044 1.0 1.0 1.0 3188.0799997530726 0.9957277140090639 99.79 Present
96241 GCF_006094475.1 SOURCE_DATABASE_REFSEQ True Complete Genome Bacillus spizizenii ATCC 6633 = JCM 2499 ATCC 6633 4045538.0 True Bacillus spizizenii True GCF_006094475.1 0.9999987640704401 0.9999999999999607 0.9999987640704794 30.85103792256793 0.9999987640704401 0.999999999999958 0.999998764070482 30.786837976602836 0.9952862581301721 99.76 Present
1613 GCF_029961225.1 SOURCE_DATABASE_REFSEQ True Complete Genome Limosilactobacillus fermentum EFEL6800 2103331.0 True Limosilactobacillus fermentum True GCF_029961225.1 0.17646569702852932 0.1786636068369798 0.9876980553154514 1.1147205952452701 0.17275099611416372 0.17304955697814606 0.9982747088799537 1.0992270709852185 0.853534466277766 92.23 Present
287 GCF_000006765.1 SOURCE_DATABASE_REFSEQ True Complete Genome Pseudomonas aeruginosa PAO1 PAO1 6264404.0 True Pseudomonas aeruginosa True GCF_024714315.1,GCF_000006765.1 0.9736496879830867 1.0 0.9736496879830867 133.57679009269043 0.9736029157761855 1.0 0.9736029157761855 133.2863715317914 0.9790992928162352 96.34 Present
4932 GCF_000146045.2 SOURCE_DATABASE_REFSEQ True Complete Genome Saccharomyces cerevisiae S288C S288C 12071326.0 True Saccharomyces cerevisiae True GCF_000146045.2 0.9702460413067091 0.9988842650498445 0.9713297878991953 7.005664504492986 0.9551612822296097 0.9987129198063666 0.9563922357335672 6.96673509016021 0.8611351208262882 88.75 Present
28901 GCF_000006945.2 SOURCE_DATABASE_REFSEQ True Complete Genome Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 LT2 4951383.0 True Salmonella enterica True GCF_000006945.2,GCF_014492125.1,GCF_022556855.1,GCF_014491785.1 0.7829200506438564 0.8320860674724839 0.9409123421835766 2.2776524817939543 0.7817768582280826 0.8295483061281343 0.9424126991193291 2.2617706866885654 0.9299024245256468 90.88 Present
562 GCF_000008865.2 SOURCE_DATABASE_REFSEQ True Complete Genome Escherichia coli O157:H7 str. Sakai Sakai substr. RIMD 0509952 5594605.0 True Escherichia coli True GCF_000008865.2 0.6591498747494352 0.7795045795270928 0.8456010292451726 2.2926196898951465 0.6569027483105632 0.7772529211094938 0.8451595747917728 2.284965162259128 0.8996365155332352 80.16 Present
1913579 GCF_001875785.1 SOURCE_DATABASE_REFSEQ False Scaffold Bacillus sp. FMQ74 FMQ74 4161595.0 True Bacillus sp. FMQ74 True GCF_001875785.1 0.003557530225790833 0.03775324221474807 0.0942311180998676 10.81661600810537 0.002592275317516481 0.02381212312913672 0.10886367853291296 9.295791620318873 0.905627438899158 10.36 Absent

The table contains the following information on the reference genome that are being selected for alignment:

After the reference genomes are being selected, the pipeline create a reference in multi-fasta format by concatenate the reference genomes, and aligns the reads (in fasta format) to this concatenated reference, and calculated the following: