samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
682 stars 240 forks source link

can INDEL_WINDOW_SIZE be increased in bcftools for longer inserts? #1569

Closed jrharting closed 3 years ago

jrharting commented 3 years ago

We're looking at ~700bp inserts using PacBio HiFi data using bcftools and we noticed some large indels (>100 or so bp) missing from the callset. I think this is due to the indel window size here https://github.com/samtools/bcftools/blob/develop/bam2bcf_indel.c#L40

Is there any reason not to increase this to something like 400-500bp?

I've attached a toy example i made that models some data I can't share at the moment.

screenshot is generated using:

BAM=synthetic_reads/800_reads.mapped.bam

bcftools mpileup --open-prob 25 \
                 --gap-frac 0.01 \
                 --ext-prob 1 \
                 --min-ireads 3 \
                 --max-depth 1000 \
                 --max-idepth 5000 \
                 -h 500 \
                 -B \
                 -a FORMAT/AD \
                 -f NC_045512.2.fasta \
                 -r NC_045512.2:26438-28428 \
                 2>/dev/null \
                 $BAM

image (10) image (9) 800_reads.bam.gz reference.fasta.gz

pd3 commented 3 years ago

I just pushed a commit that adds a new --indel-size option which allows to override the default value. Some benchmarking should be done before this is made the new default with -X ont and -X pacbio-ccs.

Thanks for reporting the issue and the test case.

jrharting commented 3 years ago

Thank you very much!

zeeev commented 3 years ago

Thanks @pd3, just to double check, the changes are on develop?

pd3 commented 3 years ago

@zeeev That's correct https://github.com/samtools/bcftools/commit/4c71bfbdaf5c3f6bbea2a3c1b38f037af7665d1f

mrmrwinter commented 2 years ago

Hi, I'm trying to use bcftools stats to call stats on a vcf, and I'm trying to use the --indel-size flag to capture indels much larger than 60bp.

I'm definitely working from the develop branch but I'm getting errors when trying to use the --indel-size flag.

stats: unrecognized option '--indel-size'

git branch is develop git rev-parse HEAD gives f2694e52e2af7e60c3129b9ec82f07b5d3e8288b

Any help would be greatly appreciated.

Thanks

pd3 commented 2 years ago

This thread is about a different command, mpileup not stats. Please open a new issue / feature request so that this is not forgotten.

mrmrwinter commented 2 years ago

Hi, my fault, i realise now.

Ive ran bcftools mpileup on my sam file with the --indel-size flag, followed by bcftools stats and plot-vcfstats, but i find no indels, whereas i've previously found them in the same alignment using other variant calling methods. I do however need to use this package with the --indel-size flag as I'm interested in indels of all sizes, not just ~60 bp as the standard seems to be for bcftools stats.

My aim is to align two assembly scaffolds with lastz, then generate a vcf and call indels. From there i can determine distribution of indel sizes in the alignment.

My process was: 1, Align two assembly scaffolds with LastZ and output a sam file 2, Converted sam to bam and sorted 3, Ran bcftools mpileup --indel-size 1000 -f asm.fa lastz_alignment.sorted.bam > output.lastz.vcf 4, Ran bcftools stats on the output.lastz.vcf 5, Ran plot-vcfstats on the output of bcftools stats

I did get some errors when running bcftools stats:

[W::vcf_parse] INFO 'D' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Q' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MQ0' is not defined in the header, assuming Type=String
[W::vcf_parse] FILTER '*' is not defined in the header

My question is why is bcftools not detecting or reporting and indels?

Sorry if this is not the right place to ask; I will open my own ticket if needs be.