peflanag commented 4 years ago

Hi, I got the following error;

ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@e584b2b appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66. Please see https://software.broadinstitute.org/gatk/documentation/article?id=6470 for more details and options related to this error.

I went to the site and it says;

If this happens to you, you'll need to run again with the flag [ --fix_misencoded_quality_scores / -fixMisencodedQuals]. What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.

I'm just wondering, in my sbatch script can I just throw the commands in like this;

!/bin/sh

SLURM Commands

SBATCH --partition=ProdQ

SBATCH --nodes=1

SBATCH --time=24:00:00

SBATCH --job-name=C10

SBATCH --account=ndlif075c

SBATCH --output=TBfull_Log.txt

SBATCH --mail-user=xxxxxxxxxxxxxxxx

SBATCH --mail-type=BEGIN,END

cd $SLURM_SUBMIT_DIR

load the environment module

module load conda/2

load the conda environment

source activate MTBseq

BASH Commands

MTBseq --step TBfull --distance 5 ----fix_misencoded_quality_scores -fixMisencodedQuals --threads 8

Cheers,

P

cutpatel commented 4 years ago

No you cannot just add it to the command. You would need to open the respective script and add the command there to the gatk command. As I do not know at which step exactly you got the error I woud guess: TBrefine.pm line 81 add -fixMisencodedQuals

But first I would check why this happens. Do you really use old illumina data with the ASCII 64 encoding? If not something else is wrong here. Maybe you can check the respective fastqs with fastqc to see.

peflanag commented 4 years ago

Hi,

A handful of the files are old Illumina reads that I want to add to a current cluster of samples. I have attached the text log file below.

TBfull_Log.txt

cutpatel commented 4 years ago

I still recommend you check your fastqs with the program "fastqc" and you will see which encoding is used. You should then change the encoding of the fastq files to phred33. See: https://www.biostars.org/p/166062/ Afterwards you can run MTBseq as you do usually.

Or you manually add that command, as I showed you above, run your samples and change it back. But starting with a proper encoding is the better option.

peflanag commented 4 years ago

Perfect, thanks so much. I will try that!

peflanag commented 4 years ago

Turns out the encoding is 1.5 but when I try to change with the tools on the link you supplied terminal keeps crashing or finishing as shown in the screen shot below. So I have decided to go to the TBrefine.pm to add the command at line 81 but where in the line do I add it? I have tried a few locations but it doesn't seem to work. Anywhere there is an X I have added the command and tried. If it cant be done its fine, its only 14 samples out of over 300 so I can just exclude them since there where from a few years ago!

$commandline = X "$GATK_call X --analysis_type RealignerTargetCreator X --reference_sequence $VAR_dir/$ref --input_file $BAM_OUT/$fullID.bam --downsample_to_coverage 10000 --num_threads $threads --out $GATK_OUT/$fullID.gatk.intervals 2>> $GATK_OUT/$logfile" X;

Cheers

Screenshot 2020-05-06 at 16 18 40 (2)

TaKohl commented 4 years ago

Thanks for the feedback. In our experience, the GATK toolkit is very sensitive to malformed data not fully confirming to specifications. From your experience of other tools failing, I would guess that the fastq files are not only encoded to Illumina specs 1.5, but there is likely also some other problem with them. For confirmation, you could check their format with a validation tool: https://genome.sph.umich.edu/wiki/FastQValidator Maybe try some tools and hope to find a program that will accept these files as input and is able to return valid fastq files. Or try contacting Illumina support, maybe they can offer a solution?

best wishes, Thomas

peflanag commented 4 years ago

Cheers Thomas. I might contact illumina and see. However, Int he grand scheme of things I might just have to accept I can’t add them to part of my analysis which will be ok I reckon.

ngs-fzb / MTBseq_source

Gatk High Quality Score #52

!/bin/sh

SLURM Commands

SBATCH --partition=ProdQ

SBATCH --nodes=1

SBATCH --time=24:00:00

SBATCH --job-name=C10

SBATCH --account=ndlif075c

SBATCH --output=TBfull_Log.txt

SBATCH --mail-user=xxxxxxxxxxxxxxxx

SBATCH --mail-type=BEGIN,END

load the environment module

load the conda environment

BASH Commands