I found several inconsistencies in chromosome naming scheme when using the pipeline with rCRS option, and I would like to make some suggestions.
First, in test_config.sh, instruction for setting mtdb_fasta incorrectly states that the parameter should be set to MT.fa when using rCRS as reference:
##OPTIONAL. If MToolBox default installation (install.sh) was used, mtdb_fasta is chrRSRS.fa (RSRS). To use rCRS sequence as reference, please set mtdb_fasta=MT.fa Otherwise, please specify the file name of the mitochondrial reference fasta sequence you want to use.
##
#mtdb_fasta=chrRSRS.fa
If mtdb_fasta is set to MT.fa, it later breaks the pipeline when it gets to assembleMTgenome.py stage because it's looking for MT.fa file in genome_fasta/ directory and the file does not exist. The instruction should say mtdb_fasta should be set to chrM.fa, reflecting the name change.
Second, chromosome names must be updated from chrRCRS to chrM in all rCRS related files in MToolBox/data/directory (i.e. chrRCRS.fa, chrRCRS.fa.fai, chrRCRS.dict, intervals_file_RCRS.list, MITOMAP_HMTDB_known_indels_RCRS.vcf). Otherwise, this breaks the pipeline at indel realigning stage because GATK complains about not being able to match chrM with chrRCRS as below:
NFO 16:14:45,697 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:14:45,699 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
INFO 16:14:45,699 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:14:45,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:14:45,703 HelpFormatter - Program Args: -T IndelRealigner -R bio/mtoolbox/MToolBox/data/chrRCRS.fa -I bio/m201/OUT_SAMPLE102/OUT.sam.bam -o bio/m201/OUT_SAMPLE102/OUT.realigned.bam -targetIntervals bio/mtoolbox/MToolBox/data/intervals_file_RCRS.list -known bio/mtoolbox/MToolBox/data/MITOMAP_HMTDB_known_indels_RCRS.vcf -compress 0
INFO 16:14:45,703 HelpFormatter - Date/Time: 2016/07/27 16:14:45
INFO 16:14:45,704 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:14:45,704 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:14:45,714 ArgumentTypeDescriptor - Dynamically determined type of bio/mtoolbox/MToolBox/data/MITOMAP_HMTDB_known_indels_RCRS.vcf to be VCF
INFO 16:14:46,321 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:14:46,405 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 16:14:46,412 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
WARNING: BAM index file bio/m201/OUT_SAMPLE102/OUT.sam.bam.bai is older than BAM bio/m201/OUT_SAMPLE102/OUT.sam.bam
INFO 16:14:46,425 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:14:46,440 RMDTrackBuilder - Loading Tribble index from disk for file bio/mtoolbox/MToolBox/data/MITOMAP_HMTDB_known_indels_RCRS.vcf
INFO 16:14:47,444 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: No overlapping contigs found.
##### ERROR reads contigs = [chrM]
##### ERROR reference contigs = [chrRCRS]
##### ERROR ------------------------------------------------------------------------------------------
Hi there,
I found several inconsistencies in chromosome naming scheme when using the pipeline with rCRS option, and I would like to make some suggestions.
First, in
test_config.sh
, instruction for settingmtdb_fasta
incorrectly states that the parameter should be set toMT.fa
when using rCRS as reference:If
mtdb_fasta
is set toMT.fa
, it later breaks the pipeline when it gets toassembleMTgenome.py
stage because it's looking forMT.fa
file ingenome_fasta/
directory and the file does not exist. The instruction should saymtdb_fasta
should be set tochrM.fa
, reflecting the name change.Second, chromosome names must be updated from chrRCRS to chrM in all rCRS related files in
MToolBox/data/
directory (i.e.chrRCRS.fa
,chrRCRS.fa.fai
,chrRCRS.dict
,intervals_file_RCRS.list
,MITOMAP_HMTDB_known_indels_RCRS.vcf
). Otherwise, this breaks the pipeline at indel realigning stage because GATK complains about not being able to match chrM with chrRCRS as below:Best, Jessica