mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Possible incorrect merging of breakend variants #25

Closed scwatts closed 2 years ago

scwatts commented 2 years ago

Hello, thanks for creating and publishing Jasmine!

I've encountered a situation where a breakend variant in one sample does not merge with the identical variant in a second sample but instead merges with the mate/partner e.g. variant_a and variant_b represent a paired set of beakends, and are present in two VCFs/samples:

VCF 1: variant_a (chr1:200) - variant_b (chr2:5000)
VCF 2: variant_a (chr1:200) - variant_b (chr2:5000)

I have found that Jasmine can produce the resulting merged variants:

variant_a_merged (chr1:200): variant_a,variant_b
variant_b_merged (chr2:5000): variant_b,variant_a

I dont think this is intended behaviour. I've created and attached a minimal example to reproduce below.

Minimal example

Expected behaviour

first.vcf:

##fileformat=VCFv4.2
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=chr1,length=100000>
##contig=<ID=chr2,length=200000>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SEQC-II_Tumor_50pc
chr1    200     chr1_200_break_first_vcf        G       [chr2:5000[G    .       PASS    MATEID=chr2_5000_break_first_vcf;SVTYPE=BND     .       .
chr2    5000    chr2_5000_break_first_vcf       A       [chr1:200[A     .       PASS    MATEID=chr1_200_break_first_vcf;SVTYPE=BND      .       .

second.vcf:

##fileformat=VCFv4.2
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=chr1,length=100000>
##contig=<ID=chr2,length=200000>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SEQC-II_Tumor_50pc
chr1    200     chr1_200_break_second_vcf       G       [chr2:5000[G    .       PASS    MATEID=chr2_5000_break_second_vcf;SVTYPE=BND    .       .
chr2    5000    chr2_5000_break_second_vcf      A       [chr1:200[A     .       PASS    MATEID=chr1_200_break_second_vcf;SVTYPE=BND     .       .

input_files.txt:

first.vcf
second.vcf

Merge variants:

$ jasmine file_list=input_files.txt out_dir=./ out_file=merged_expected.vcf 1>/dev/null 2>&1
$ { 
    echo -e "CHRM\tPOS\tID\tIDLIST";
    bcftools query -f '%CHROM\t%POS\t%ID\t[%IDLIST]\n' merged_expected.vcf;
  } | column -t -s$'\t'
CHRM  POS   ID                           IDLIST
chr1  200   0_chr1_200_break_first_vcf   chr1_200_break_first_vcf,chr1_200_break_second_vcf
chr2  5000  0_chr2_5000_break_first_vcf  chr2_5000_break_first_vcf,chr2_5000_break_second_vcf

Unexpected behaviour

Repeat exactly as above but including the INFO/DEBUG_1234 field in first.vcf to trigger the unexpected behaviour.

first.vcf:

##fileformat=VCFv4.2
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=DEBUG_1234,Number=1,Type=String,Description="For debug purposes">
##contig=<ID=chr1,length=100000>
##contig=<ID=chr2,length=200000>
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  SEQC-II_Tumor_50pc
chr1  200 chr1_200_break_first_vcf  G [chr2:5000[G  . PASS  MATEID=chr2_5000_break_first_vcf;SVTYPE=BND;DEBUG_1234=NOVALUE  . .
chr2  5000  chr2_5000_break_first_vcf A [chr1:200[A . PASS  MATEID=chr1_200_break_first_vcf;SVTYPE=BND;DEBUG_1234=NOVALUE . .

second.vcf:

##fileformat=VCFv4.2
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=chr1,length=100000>
##contig=<ID=chr2,length=200000>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SEQC-II_Tumor_50pc
chr1    200     chr1_200_break_second_vcf       G       [chr2:5000[G    .       PASS    MATEID=chr2_5000_break_second_vcf;SVTYPE=BND    .       .
chr2    5000    chr2_5000_break_second_vcf      A       [chr1:200[A     .       PASS    MATEID=chr1_200_break_second_vcf;SVTYPE=BND     .       .

input_files.txt:

first.vcf
second.vcf
$ jasmine file_list=input_files.txt out_dir=./ out_file=merged_unexpected.vcf 1>/dev/null 2>&1
$ { 
    echo -e "CHRM\tPOS\tID\tIDLIST";
    bcftools query -f '%CHROM\t%POS\t%ID\t[%IDLIST]\n' merged_unexpected.vcf;
  } | column -t -s$'\t'
CHRM  POS   ID                           IDLIST
chr2  5000  0_chr2_5000_break_first_vcf  chr2_5000_break_first_vcf,chr1_200_break_second_vcf
chr1  200   0_chr1_200_break_first_vcf   chr1_200_break_first_vcf,chr2_5000_break_second_vcf

Install info

Above was done using Jasmine 1.1.4 on macOS from conda and installed via:

conda create -p $(pwd -P)/conda_env/ -y -c bioconda -c conda-forge jasminesv
conda activate conda_env/
$ conda list
# packages in environment at /Users/stephen/projects/jasmine_variant_merging/3_debug_variants/conda_env:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                hc929b4f_4    conda-forge
c-ares                    1.17.2               h0d85af4_0    conda-forge
ca-certificates           2021.10.8            h033912b_0    conda-forge
certifi                   2021.5.30                pypi_0    pypi
htslib                    1.13                 hc38c3fb_0    bioconda
irissv                    1.0.4                hdfd78af_2    bioconda
jasminesv                 1.1.4                hdfd78af_0    bioconda
k8                        0.2.5                h87af4ef_1    bioconda
krb5                      1.19.2               hcfbf3a7_2    conda-forge
libcurl                   7.79.1               hf45b732_1    conda-forge
libcxx                    12.0.1               habf9029_0    conda-forge
libdeflate                1.7                  h35c211d_5    conda-forge
libedit                   3.1.20210714         h9ed2024_0  
libev                     4.33                 haf1e3a3_1    conda-forge
libffi                    3.4.2                he49afe7_4    conda-forge
libnghttp2                1.43.0               h6f36284_1    conda-forge
libssh2                   1.10.0               h52ee1ee_2    conda-forge
libzlib                   1.2.11            h9173be1_1013    conda-forge
minimap2                  2.22                 h188c3c3_0    bioconda
ncurses                   6.2                  h2e338ed_4    conda-forge
openjdk                   11.0.9.1             hcf210ce_1    conda-forge
openssl                   1.1.1l               h0d85af4_0    conda-forge
pip                       21.2.4             pyhd8ed1ab_0    conda-forge
python                    3.10.0          h1248fe1_1_cpython    conda-forge
racon                     1.4.20               h87af4ef_1    bioconda
readline                  8.1                  h05e3726_0    conda-forge
samtools                  1.13                 h7596a89_0    bioconda
setuptools                58.0.4                   pypi_0    pypi
sqlite                    3.36.0               h23a322b_2    conda-forge
tk                        8.6.11               h5dbffcc_1    conda-forge
tzdata                    2021c                he74cb21_0    conda-forge
wheel                     0.37.0             pyhd8ed1ab_1    conda-forge
xz                        5.2.5                haf1e3a3_1    conda-forge
zlib                      1.2.11            h9173be1_1013    conda-forge
ohofmann commented 2 years ago

@mkirsche Does that sound familiar? We're currently in an accreditation process and this is a potential blocker; I'd love to use Jasmine for this if at all possible. Thank you!

mkirsche commented 2 years ago

Hi,

Sorry for the late response! Jasmine does not currently support maintaining breakend pairs; it assumes that each variant is represented by a single entry, representing breakend pairs with either CHR2/END INFO fields or the colon-delimited ALT field specified in the VCF file format. I am hoping to update this behavior in the next version of Jasmine, but that is at least a few weeks out. In the meantime, I would recommend merging BND pairs into single entries to avoid this issue.

Best, Melanie

ohofmann commented 2 years ago

Thanks for the update!

scwatts commented 2 years ago

Hi Melanie, I've had some time today to return to this - your suggestion has worked well, thank you!