lgmgeo / AnnotSV

Annotation and Ranking of Structural Variation
GNU General Public License v3.0
220 stars 34 forks source link

Queries on annotations #165

Closed priyambial123 closed 1 year ago

priyambial123 commented 1 year ago

Hello,

Thank you. I found the tool to be super helpful. I have few queries and suggestions:

GnomAD SV database is from 10,847 individual genomes. Study was started on 14,891 individuals and after QC steps the data was available from 10,847. This is specified as 14,891 in the PDF manual

In the DECIPHER database, the allele frequency has been calculated from individuals with developmental disorders. How is the benign allele frequency reported here?

There were pathogenic structural variations reported in dbVar along with coordinates in the output of the annotated files. But these coordinates don't match as the one reported in dbVar. For example,

I am trying to understand the annotations. It would be very helpful if can clarify these queries

From annotated output file:
dbVar:nssv15140042 2:162221545-181254890
dbVar:nssv15161685 2:163965382-182195062
dbVar:nssv15161863 2:166473076-191891647
From dbVar database:
dbVar:nssv15140042 [GRCh38.p12 chr2: 155,632,918-182,056,571]
dbVar:nssv15161685  [GRCh38.p12 chr2: 12,771-241,841,232]
dbVar:nssv15161685  [GRCh38.p12 chr2: 12,771-241,841,232]
lgmgeo commented 1 year ago

Hi,

Thank you for your interest in AnnotSV.

GnomAD SV database is from 10,847 individual genomes. Study was started on 14,891 individuals and after QC steps the data was available from 10,847. This is specified as 14,891 in the PDF manual

This number is indeed specified in the README, but as a citation: image

The gnomAD sources used for AnnotSV are detailed in the README: image

In the DECIPHER database, the allele frequency has been calculated from individuals with developmental disorders. How is the benign allele frequency reported here?

It is to notice the source of these data (Common copy-number variants): image

Else, cf the "DDD benign SV annotations" section from the README: image

There were pathogenic structural variations reported in dbVar along with coordinates in the output of the annotated files. But these coordinates don't match as the one reported in dbVar(...)

Can you send me the coordinates of the SV annotated by AnnotSV with dbVar:nssv15140042? (GRCh38)

priyambial123 commented 1 year ago

This is the coordinate of SV in chromosome 2:

SV_start:178436319
SV_end: 178443171
priyambial123 commented 1 year ago

Can you explain why the coordinates of the pathogenic structural variations (from dbVar) in the annotated file don't match as the one reported in dbVar? . Is the coordinates expanded based on some assumption here ?

Thank you

lgmgeo commented 1 year ago

I'm working on it, I get back to you asap

lgmgeo commented 1 year ago

I run your example (2:178436319-178443171 DUP) on the web server (GRCh38): https://lbgi.fr/AnnotSV/display?id=EaDX1tRg85

Explanation:

poP*_*” features: po_P_gain_phen
po_P_gain_hpo po_P_gain_source po_P_gain_coord po_P_gain_percent
po_P_loss_phen
po_P_loss_hpo
po_P_loss_source
po_P_loss_coord po_P_loss_percent

Currently, redundancy is removed from all “poP*_*” features (thanks to a sort -unique command). That is essential with annotation of large SV. There is therefore no longer any correspondence between the “poP*_*” features, and I realized that it's actually not the best thing to do.

In a future version, redundancy will be removed only from “poP*_phen” and “poP*_hpo” features. So AnnotSV will keep the correspondence between “poP*_source”, “poP*_coord” and “poP*_percent” features.

lgmgeo commented 1 year ago
priyambial123 commented 1 year ago

Thank you, now I understand that the coordinates are not in the same order as the nssvID

Priya