nf-core / raredisease

Call and score variants from WGS/WES of rare disease patients.
https://nf-co.re/raredisease
MIT License
90 stars 34 forks source link

gnomAD v4.1 SV database contains mix of SV and CNV - svdb problem #615

Open fa2k opened 2 months ago

fa2k commented 2 months ago

Description of the bug

gnomAD SV v4.1 (https://gnomad.broadinstitute.org/news/2023-11-v4-structural-variants/) contains some CNVs that don't have AC or AF information in the vcf (gnomad.v4.1.sv.sites.vcf.gz).

svdb --query is refusing to annotate the vcf if the annotations for --in_occ or --in_frq are missing (in this case AC and AF) for some variants in the database, producing an output vcf without any variants.

It works if I remove lines without AC / AF from the gnomad file, but that means we remove this gnomAD information. Would it make sense to somehow integrate this CNV information for the annotation of SVs instead?

Command used and terminal output

NFCORE_RAREDISEASE:RAREDISEASE:ANNOTATE_STRUCTURAL_VARIANTS:SVDB_QUERY_DB command.sh:

#!/bin/bash -euo pipefail
svdb \
    --merge \
    --pass_only --same_order \
    --priority tiddit,manta,cnvnator \
    --vcf  NA12878_tiddit.vcf.gz:tiddit NA12878_manta.diploid_sv.vcf.gz:manta NA12878_cnvnator.vcf.gz:cnvnator \
    > NA12878_sv.vcf
bgzip NA12878_sv.vcf

cat <<-END_VERSIONS > versions.yml
"NFCORE_RAREDISEASE:RAREDISEASE:CALL_STRUCTURAL_VARIANTS:SVDB_MERGE":
    svdb: $( echo $(svdb) | head -1 | sed 's/usage: SVDB-\([0-9]\.[0-9]\.[0-9]\).*/\1/' )
    samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS

--------

Output (command.log): 

INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
Error: frequency or hit tag not found! Make sure to set the --in_occ AND --in_frq to the number and frequency of alleles/individuals as presented in the INFO column of the input db

database variants not having the --in_occ or --in_frq tag must be removed
you may also skip these parameters and cluster based on the GT entry of the format column (if such exists)

Relevant files

No response

System information

No response

ramprasadn commented 1 month ago

I have submitted an issue in the SVDB repository @fa2k. You can find it at this link: https://github.com/J35P312/SVDB/issues/74. Let's continue the discussion there.