Queries on annotations - Githubissues

priyambial123 commented 1 year ago

Hello,

Thank you. I found the tool to be super helpful. I have few queries and suggestions:

GnomAD SV database is from 10,847 individual genomes. Study was started on 14,891 individuals and after QC steps the data was available from 10,847. This is specified as 14,891 in the PDF manual

In the DECIPHER database, the allele frequency has been calculated from individuals with developmental disorders. How is the benign allele frequency reported here?

There were pathogenic structural variations reported in dbVar along with coordinates in the output of the annotated files. But these coordinates don't match as the one reported in dbVar. For example,

I am trying to understand the annotations. It would be very helpful if can clarify these queries

From annotated output file:
dbVar:nssv15140042 2:162221545-181254890
dbVar:nssv15161685 2:163965382-182195062
dbVar:nssv15161863 2:166473076-191891647

From dbVar database:
dbVar:nssv15140042 [GRCh38.p12 chr2: 155,632,918-182,056,571]
dbVar:nssv15161685  [GRCh38.p12 chr2: 12,771-241,841,232]
dbVar:nssv15161685  [GRCh38.p12 chr2: 12,771-241,841,232]

lgmgeo commented 1 year ago

Hi,

Thank you for your interest in AnnotSV.

GnomAD SV database is from 10,847 individual genomes. Study was started on 14,891 individuals and after QC steps the data was available from 10,847. This is specified as 14,891 in the PDF manual

This number is indeed specified in the README, but as a citation:

The gnomAD sources used for AnnotSV are detailed in the README:

In the DECIPHER database, the allele frequency has been calculated from individuals with developmental disorders. How is the benign allele frequency reported here?

It is to notice the source of these data (Common copy-number variants):

Else, cf the "DDD benign SV annotations" section from the README:

There were pathogenic structural variations reported in dbVar along with coordinates in the output of the annotated files. But these coordinates don't match as the one reported in dbVar(...)

Can you send me the coordinates of the SV annotated by AnnotSV with dbVar:nssv15140042? (GRCh38)

priyambial123 commented 1 year ago

This is the coordinate of SV in chromosome 2:

SV_start:178436319
SV_end: 178443171

priyambial123 commented 1 year ago

Can you explain why the coordinates of the pathogenic structural variations (from dbVar) in the annotated file don't match as the one reported in dbVar? . Is the coordinates expanded based on some assumption here ?

Thank you

lgmgeo commented 1 year ago

I'm working on it, I get back to you asap

lgmgeo commented 1 year ago

I run your example (2:178436319-178443171 DUP) on the web server (GRCh38): https://lbgi.fr/AnnotSV/display?id=EaDX1tRg85

Let's have a look at the po_P_gain_source result: dbVar:nssv15140042; dbVar:nssv15161685; nssv15161863; dbVar:nssv15162217; dbVar:nssv15174359; dbVar:nssv15174602; dbVar:nssv16207855; dbVar:nssv16254741; dbVar:nssv17969793

Let's have a look at the dbVar:nssv15140042 pathogenic SV annotation distributed in AnnotSV (BED format):

grep nssv15140042 $ANNOTSV/share/AnnotSV/Annotations_Human/FtIncludedInSV/PathogenicSV/GRCh38/pathogenic_Gain_SV_GRCh38.sorted.bed
2       155632918       182056571                       dbVar:nssv15140042      2:155632918-182056571

Let's have a look at the po_P_gain_coord feature in AnnotSV (VCF format): 2:12772-241841232; 2:14239-242106609; 2:151553465-178461009; 2:155632919-182056571; 2:15673-242157305; 2:162376653-211062464; 2:168973465-214656712; 2:177533232-242065306

Explanation:

“poP*_*” features: po_P_gain_phen
po_P_gain_hpo po_P_gain_source po_P_gain_coord po_P_gain_percent
po_P_loss_phen
po_P_loss_hpo
po_P_loss_source
po_P_loss_coord po_P_loss_percent

Currently, redundancy is removed from all “poP*_*” features (thanks to a sort -unique command). That is essential with annotation of large SV. There is therefore no longer any correspondence between the “poP*_*” features, and I realized that it's actually not the best thing to do.

In a future version, redundancy will be removed only from “poP*_phen” and “poP*_hpo” features. So AnnotSV will keep the correspondence between “poP*_source”, “poP*_coord” and “poP*_percent” features.

lgmgeo commented 1 year ago

in a BED file, the first base on the chromosome is numbered 0 (zero indexed).
in a VCF file, the first base on the chromosome is numbered 1 (one-indexed).

priyambial123 commented 1 year ago

Thank you, now I understand that the coordinates are not in the same order as the nssvID

Priya

lgmgeo / AnnotSV

Queries on annotations #165

Explanation: