pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Pos with confident scores < 90% #100

Open mmlian opened 4 weeks ago

mmlian commented 4 weeks ago

Hi Pauline,

Thank you for this resource to format SIFT4G databases. I'm attempting to create the human database using a recent ensembl release (GRCh38, v112). However, the Pos with Confident Scores are less than 90%. Following are the scores reported in CHECK_GENES.LOG

Chr Genes with SIFT Scores  Pos with SIFT scores    Pos with Confident Scores
1   99 (9132/9191)  100 (27011926/27017955) 60(16255845/27011926)
10  100 (3666/3684) 100 (11267510/11269164) 57(6400632/11267510)
11  99 (6405/6443)  100 (16547144/16551938) 62(10293209/16547144)
12  99 (5430/5477)  100 (14496123/14500831) 58(8410615/14496123)
13  99 (1429/1437)  100 (5072825/5073770)   56(2837065/5072825)
14  99 (3503/3534)  100 (9893613/9896637)   59(5808266/9893613)
15  99 (3169/3208)  100 (9013810/9017834)   57(5160516/9013810)
16  99 (4712/4759)  100 (11385585/11391294) 68(7698045/11385585)
17  99 (6336/6392)  100 (16183935/16189331) 59(9623114/16183935)
18  99 (1612/1636)  100 (4666645/4669378)   57(2671224/4666645)
19  100 (6647/6680) 100 (16347154/16351116) 67(10939216/16347154)
2   99 (6645/6685)  100 (22971896/22975822) 52(12033057/22971896)
20  99 (2350/2367)  100 (6197276/6198167)   61(3764631/6197276)
21  99 (932/943)    100 (2792386/2793817)   63(1768235/2792386)
22  99 (2035/2046)  100 (5501794/5504166)   62(3400959/5501794)
3   99 (6155/6190)  100 (18213229/18216986) 57(10345147/18213229)
4   100 (3845/3863) 100 (12160063/12161965) 55(6718884/12160063)
5   99 (4223/4255)  100 (11855241/11858302) 60(7125778/11855241)
6   100 (4611/4634) 100 (13653872/13656835) 62(8426507/13653872)
7   100 (4469/4483) 100 (13028052/13029448) 57(7439183/13028052)
8   99 (3580/3608)  100 (9393778/9400870)   59(5582444/9393778)
9   99 (3434/3452)  100 (11081037/11082543) 63(6945343/11081037)
GL000009.2  100 (1/1)   100 (490/490)   0(0/490)
GL000194.1  100 (2/2)   100 (1665/1665) 57(956/1665)
GL000195.1  100 (1/1)   100 (769/769)   0(0/769)
GL000205.2  0 (0/0) 0 (0/0) 0(0/0)
GL000213.1  100 (2/2)   100 (6519/6519) 74(4813/6519)
GL000216.2  0 (0/0) 0 (0/0) 0(0/0)
GL000218.1  100 (1/1)   100 (1542/1542) 0(0/1542)
GL000219.1  100 (1/1)   100 (843/843)   81(687/843)
GL000220.1  0 (0/0) 0 (0/0) 0(0/0)
GL000225.1  0 (0/0) 0 (0/0) 0(0/0)
KI270442.1  0 (0/0) 0 (0/0) 0(0/0)
KI270711.1  100 (2/2)   100 (8394/8394) 80(6736/8394)
KI270713.1  100 (2/2)   100 (2225/2225) 0(0/2225)
KI270721.1  100 (1/1)   100 (1010/1010) 100(1010/1010)
KI270726.1  100 (2/2)   100 (1382/1382) 0(0/1382)
KI270727.1  100 (4/4)   100 (10658/10658)   85(9024/10658)
KI270728.1  100 (5/5)   100 (9606/9606) 0(0/9606)
KI270731.1  100 (1/1)   100 (2748/2748) 65(1782/2748)
KI270733.1  0 (0/0) 0 (0/0) 0(0/0)
KI270734.1  100 (4/4)   100 (11308/11308)   62(7033/11308)
KI270744.1  0 (0/0) 0 (0/0) 0(0/0)
KI270750.1  0 (0/0) 0 (0/0) 0(0/0)
MT  100 (7/7)   100 (12241/12241)   18(2147/12241)
X   100 (3611/3624) 100 (10755922/10757378) 59(6342125/10755922)
Y   98 (189/193)    100 (617316/617680) 48(295900/617316)

ALL 99 (98156/98820)    100 (280179532/280254627)   59(166320128/280179532)

grep ">" all_prot.fasta | wc -l ##returns 98732

Uniref90 was utilized for the database creation.

May I check if the human database is created okay?

Secondly, I'm unable to load the SIFT4G databases to counter-check. Wondering if there's any issues with the web-site?

Thank you very much for your advice and time. :)