samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Segmentation fault on `INFO/TAG=@file.txt` if TAG does not exist in first record #2111

Closed bryce-turner closed 6 months ago

bryce-turner commented 7 months ago

I'm running into a segmentation fault when using the new INFO/TAG=@file.txt filtering feature. Running bcftools v1.19 - which is where this new feature became available, testing within a singularity environment using the image hosted on quay.io as well as self built dockerfile/compilation.

Example of the issue, given the following minimal vcf:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.">
##INFO=<ID=TAG,Number=1,Type=String,Description="This is an example of a string in info.">
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    11558102    .   G   GT  .   PASS    DP=61
chr1    11558105    .   G   C   .   PASS    DP=61;TAG=Example
chr1    11558108    .   A   T   .   PASS    DP=61;TAG=It

And a short file of desired strings:

$ cat strings_expected
Example
Something

The following bcftools command hits a segmentation fault:

$ bcftools view --include 'INFO/TAG=@strings_expected' example.vcf
Segmentation fault (core dumped)

Adding a trace via gdb yields:

$ gdb --args bcftools view --include 'INFO/TAG=@strings_expected' example.vcf
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bcftools...
(gdb) run
Starting program: /usr/local/bin/bcftools view --include INFO/TAG=@strings_expected example.vcf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
kh_get_str2int (key=0x0, h=0x5555557b4250) at htslib-1.19/htslib/khash_str2int.h:30
30      htslib-1.19/htslib/khash_str2int.h: No such file or directory.
(gdb) bt
#0  kh_get_str2int (key=0x0, h=0x5555557b4250) at htslib-1.19/htslib/khash_str2int.h:30
#1  khash_str2int_has_key (str=0x0, _hash=0x5555557b4250) at htslib-1.19/htslib/khash_str2int.h:69
#2  filters_cmp_string_hash (atok=<optimized out>, btok=0x5555557b4280, rtok=0x5555557b4400, line=0x5555557b4bc0) at filter.c:616
#3  0x0000555555597b6b in filter_test (filter=<optimized out>, line=line@entry=0x5555557b4bc0, samples=samples@entry=0x0) at filter.c:3905
#4  0x00005555555b0b87 in subset_vcf (args=0x555555788240, line=0x5555557b4bc0) at vcfview.c:334
#5  0x00005555555b2be5 in subset_vcf (line=0x5555557b4bc0, args=0x555555788240) at vcfview.c:318
#6  main_vcfview (argc=<optimized out>, argv=<optimized out>) at vcfview.c:819
#7  0x00007ffff75f5d0a in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00005555555690ba in _start ()

If TAG exists in the first record, then we do not have this issue:

$ cat example_no_segfault.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.">
##INFO=<ID=TAG,Number=1,Type=String,Description="This is an example of a string in info.">
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    11558102    .   G   GT  .   PASS    DP=50;TAG=Example
chr1    11558105    .   G   C   .   PASS    DP=50
chr1    11558108    .   A   T   .   PASS    DP=51;TAG=It
$ bcftools view --include 'INFO/TAG=@strings_expected' example_no_segfault.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.">
##INFO=<ID=TAG,Number=1,Type=String,Description="This is an example of a string in info.">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view --include INFO/TAG=@strings_expected example_no_segfault.vcf; Date=Wed Feb 28 22:21:13 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    11558102        .       G       GT      .       PASS    DP=50;TAG=Example
chr1    11558105        .       G       C       .       PASS    DP=50

Notably this is including a variant record that does not have the INFO/TAG, but I believe this might have been observed previously and is expected behavior, or at least the workaround is --include 'INFO/TAG=@strings_expected & INFO/TAG!="."'

I've noticed a similar segfault behavior if the input file contains only integers - though the release notes do state that INFO/TAG=@file.txt supports only strings, so the weird behavior is understandable. Let me know if a separate issue should be raised or if we'd like a minimal example for testing.

pd3 commented 6 months ago

Thank you for the bug report. I believe this is now fixed by https://github.com/samtools/bcftools/commit/1ea51123949bcdac9eb932b7627295eb88ba53bb. Please let me know if you encounter any other problems