vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

CRISP calling variants outside of the specified bed file #26

Closed SSmith-NUIG closed 2 years ago

SSmith-NUIG commented 2 years ago

Hi, I'm attempting to run CRISP on all chromosomes separately before concatenating into a final VCF file but it doesn't seem to be skipping over the regions not contained in the bed file.

For example, 2.bed contains this:

LG2 0   16089512

yet the crisp_2.vcf contains calls from outside of this region:

LG1 3536    .   G   A   3797    LowMQ20;LowDepth    NP=96;DP=86,81,110;VT=SNV;CT=-5.4;VP=20;VF=EMpass;AC=4388;AF=0.75512;EMst

Here is the top of the logfile:

`CRISP options: QVoffset 33 min_base_quality 20 min_mapping_score 20 max_permutations 20000 poolsize 60 CT-pvalue-thresh -3.5 QV-pvalue-thresh -5.
0

reading fasta index file apis_c_l_genome.fa.fai ... fasta file apis_c_l_genome.fa has 177 chromos
omes/contigs

read 1 intervals from bedfile 2.bed

processing 96 bamfiles: A_100.bam ..... /A_D3.bam 

poolsize for each sample is 60 
finished reading bam file indexes 
reading chromosome LG1 offset 5 # targeted bases on chrom is 0/27754200 
##CRISP_command CRISP.binary --bams full_bam_list.txt --ref apis_c_l_genome.fa --poolsize 60 --bed 2.bed --VCF crisp_2.vcf --qvoffset 33 --mbq 20 --mmq 20 --minc 5

`

Here is the SLURM script

#!/bin/sh 
#SBATCH --job-name="CRISP"
#SBATCH --output=CRISP%A_%a.log
#SBATCH --array=1-17

CRISP.binary \
--bams full_bam_list.txt \
--ref c_l_genome/apis_c_l_genome.fa \
--poolsize 60 \
--bed "$SLURM_ARRAY_TASK_ID".bed \
--VCF crisp_"$SLURM_ARRAY_TASK_ID".vcf \
--qvoffset 33 \
--mbq 20 \
--mmq 20 \
--minc 5 
SSmith-NUIG commented 2 years ago

Bed file argument doesn't seem to work so posting my solution here in case anyone runs into similar problem

Take your bed file of chromosome regions and convert it to his format:

LG1:0-27754200
LG2:0-16089512
LG3:0-13619445
LG4:0-13404451
LG5:0-13896941
LG6:0-17789102
LG7:0-14198698
LG8:0-12717210
LG9:0-12354651
LG10:0-12360052
LG11:0-16352600
LG12:0-11514234
LG13:0-11279722
LG14:0-10670842
LG15:0-9534514
LG16:0-7238532
MT:0-16343

The next step can be skipped if you want to loop over the previous file but I found it easier this way because of how SLURM arrays work: Split up the previous file into region files for each chromosome using a script like this:

i=0
while read p; do
    i=$[i + 1]
  echo "$p" > "$i"_region.txt
done < crisp_regions.txt

Then a SLURM array job can be submitted easily using this:

#!/bin/sh 
#SBATCH --job-name="CRISP"
#SBATCH --output=/logs/CRISP%A_%a.log
#SBATCH --array=1-17

while IFS= read -r line; do

CRISP.binary \
--bams full_bam_list.txt \
--ref genome.fa \
--poolsize 60 \
--regions $line \
--VCF crisp_"$SLURM_ARRAY_TASK_ID".vcf \
--qvoffset 33 \
--mbq 20 \
--mmq 20 \
--minc 5 

done < "$SLURM_ARRAY_TASK_ID"_region.txt

It's now doing each chromosome separately which should speed things up and allow easy concatenation at the end.

vibansal commented 2 years ago

Thank you for posting this issue and providing the solution as well. The --regions option is better suited for analyzing chromosomes separately. The --bed option and the --regions can be used together. The --bed option should only output variants in the intervals specified in the file, I will look in the code to see if there is an error.

vibansal commented 2 years ago

The code has been updated to fix this problem. Thank you for raising this issue.