rockt / SETH

SNP Extraction Tool for Human Variations
rockt.github.com/SETH
Other
27 stars 16 forks source link

HGVS c/p copied to invalid transcripts for overlapping genes #12

Closed jhkbg closed 8 years ago

jhkbg commented 8 years ago

I hope I can explain this alright:

Assume we have two genes that both overlap the position of a SNP. In the example below, the SNP is rs893184; and the two genes are A1BG (Entrez ID = 1) and A1BG-AS1 (Entrez = 503538).

Only one of the genes is coding for a protein (A1BG) and the other one is non-coding (A1BG-AS1).

dbSNP will contain a list of c/g/p variants, including c.155A>G and p.His52Arg for the above SNP rs893184. When we generate the HGVS table for SETH, we are currently assigning these two c/p entries to both genes, 1 and 503538. However, they pertain only to gene ID 1.

Gene ID 503538 does not code for the transcript NM_130786.3 and the protein NP_570602.2; therefore, 503538 should have only the g. variant and (I guess) the ncRNA variant for NR_015380.1.

ij> select * from HGVS where snp_id=893184; LOCUS_ID |SNP_ID |HGVS |REFSEQ 1 |893184 |c.155A>G |NM_130786.3 1 |893184 |g.58864479T>C |NC_000019.9 1 |893184 |n.1075+69T>C |NR_015380.1 1 |893184 |p.His52Arg |NP_570602.2 503538 |893184 |c.155A>G |NM_130786.3 incorrect 503538 |893184 |g.58864479T>C |NC_000019.9 503538 |893184 |n.1075+69T>C |NR_015380.1 not totally sure about this one 503538 |893184 |p.His52Arg |NP_570602.2 incorrect

Erechtheus commented 8 years ago

Sounds like a problem in one of the parsing steps. I'll try to have a deeper look..

Erechtheus commented 8 years ago

The problem seems to occur in https://github.com/rockt/SETH/blob/master/src/main/java/de/hu/berlin/wbi/stuff/xml/ParseXMLToFile.java#L93-L111, when parsing the following XML-Fragment while parsing ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch19.xml.gz

`

`

Erechtheus commented 8 years ago

Bug should be fixed by: https://github.com/rockt/SETH/commit/afd6670faee98bb81fed7a97ceec387cc44e199c

Requires better testing....