Closed jhkbg closed 8 years ago
Sounds like a problem in one of the parsing steps. I'll try to have a deeper look..
The problem seems to occur in https://github.com/rockt/SETH/blob/master/src/main/java/de/hu/berlin/wbi/stuff/xml/ParseXMLToFile.java#L93-L111, when parsing the following XML-Fragment while parsing ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch19.xml.gz
`
`
Bug should be fixed by: https://github.com/rockt/SETH/commit/afd6670faee98bb81fed7a97ceec387cc44e199c
Requires better testing....
I hope I can explain this alright:
Assume we have two genes that both overlap the position of a SNP. In the example below, the SNP is rs893184; and the two genes are A1BG (Entrez ID = 1) and A1BG-AS1 (Entrez = 503538).
Only one of the genes is coding for a protein (A1BG) and the other one is non-coding (A1BG-AS1).
dbSNP will contain a list of c/g/p variants, including c.155A>G and p.His52Arg for the above SNP rs893184. When we generate the HGVS table for SETH, we are currently assigning these two c/p entries to both genes, 1 and 503538. However, they pertain only to gene ID 1.
Gene ID 503538 does not code for the transcript NM_130786.3 and the protein NP_570602.2; therefore, 503538 should have only the g. variant and (I guess) the ncRNA variant for NR_015380.1.
ij> select * from HGVS where snp_id=893184; LOCUS_ID |SNP_ID |HGVS |REFSEQ 1 |893184 |c.155A>G |NM_130786.3 1 |893184 |g.58864479T>C |NC_000019.9 1 |893184 |n.1075+69T>C |NR_015380.1 1 |893184 |p.His52Arg |NP_570602.2 503538 |893184 |c.155A>G |NM_130786.3 incorrect 503538 |893184 |g.58864479T>C |NC_000019.9 503538 |893184 |n.1075+69T>C |NR_015380.1 not totally sure about this one 503538 |893184 |p.His52Arg |NP_570602.2 incorrect