Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22
stars
7
forks
source link
'Pos with Confident Scores' is low in "CHECK_GENES.LOG" #65
your organism's proteome doesn't have many homologues
Point 2 can be several reasons:
When predicting proteins in this genome, were liberal settings used? Hence, a large fraction of the proteins are not real, and won't have homologues, so SIFT can't make predictions.
Is your organism weird and distinct from other organisms that have sequenced? If it's weird, so there are not many homologous proteins in your protein database, SIFT won't be able to make predictions.
Hi, pauline This is pengfei. I constructed a new datavase. Then I checked the database. I go to "CHECK_GENES.LOG", then got "ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)". And you said "Your database is done if the percentages are high for the last 3 different columns." But the 'Pos with Confident Scores' is low. Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores 1 99 (1536/1544) 100 (6106447/6111160) 17(1010023/6106447) 10 100 (1583/1586) 100 (6491552/6493442) 12(793428/6491552) 11 100 (1656/1661) 100 (6918920/6922883) 14(994782/6918920) 12 100 (621/623) 100 (2720527/2721496) 11(308259/2720527) 13 100 (1279/1284) 100 (4639379/4642032) 13(597783/4639379) 14 100 (786/787) 100 (3394346/3395503) 15(516357/3394346) 15 100 (1533/1537) 100 (5208466/5212123) 17(868543/5208466) 16 100 (1186/1188) 100 (5177871/5178717) 19(1002244/5177871) 17 100 (1078/1081) 100 (4690593/4693044) 10(450682/4690593) 18 100 (1952/1958) 100 (6976052/6981044) 25(1732116/6976052) 19 100 (2076/2076) 100 (8342833/8342833) 14(1182843/8342833) 2 100 (1593/1598) 100 (7236744/7242000) 9(664736/7236744) 20 100 (569/570) 100 (2484254/2485544) 12(307858/2484254) 21 100 (892/896) 99 (3738173/3766981) 15(556627/3738173) 22 100 (1027/1032) 100 (4520633/4525211) 10(456486/4520633) 23 100 (1152/1153) 100 (3866972/3868343) 18(709417/3866972) 24 100 (554/554) 100 (2515278/2515278) 10(251265/2515278) 25 100 (1182/1183) 100 (4439937/4440745) 17(755517/4439937) 26 100 (716/719) 100 (2842268/2844173) 10(293528/2842268) 27 100 (421/422) 100 (1802775/1803409) 17(301302/1802775) 28 99 (528/531) 100 (2425027/2426437) 12(286890/2425027) 29 100 (1043/1048) 100 (3824056/3834534) 17(644781/3824056) 3 100 (2121/2127) 100 (7798555/7805777) 17(1320641/7798555) 4 100 (1288/1292) 100 (5537040/5541068) 13(716239/5537040) 5 100 (2124/2127) 100 (8413279/8414984) 18(1488746/8413279) 6 100 (1027/1027) 100 (4114550/4114550) 13(552589/4114550) 7 100 (2062/2065) 100 (7961434/7965089) 17(1380150/7961434) 8 99 (1186/1195) 100 (4956491/4963221) 12(595398/4956491) 9 100 (881/882) 100 (3810505/3811393) 11(436621/3810505) MT 0 (0/8) 0 (0/13964) 0(0/0) NKLS02000056.1 100 (2/2) 100 (1997/1997) 97(1939/1997) NKLS02000065.1 100 (1/1) 100 (483/483) 0(0/483) NKLS02000090.1 100 (1/1) 100 (3012/3012) 0(0/3012) NKLS02000101.1 100 (1/1) 100 (2267/2267) 100(2267/2267) NKLS02000125.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000137.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000210.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000218.1 100 (1/1) 100 (857/857) 0(0/857) . . . ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)
Then I find your reply in this issue https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/43#issuecomment-762525915 and you say"your numbers look good to me. "Genes with SIFT scores" and "Pos with SIFT Scores" are > 95%." So the database I constructed can be used? Look forward to your reply. Thanks very much!