'Pos with Confident Scores' is low in "CHECK_GENES.LOG"

wpf95 commented 2 years ago

Hi, pauline This is pengfei. I constructed a new datavase. Then I checked the database. I go to "CHECK_GENES.LOG", then got "ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)". And you said "Your database is done if the percentages are high for the last 3 different columns." But the 'Pos with Confident Scores' is low. Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores 1 99 (1536/1544) 100 (6106447/6111160) 17(1010023/6106447) 10 100 (1583/1586) 100 (6491552/6493442) 12(793428/6491552) 11 100 (1656/1661) 100 (6918920/6922883) 14(994782/6918920) 12 100 (621/623) 100 (2720527/2721496) 11(308259/2720527) 13 100 (1279/1284) 100 (4639379/4642032) 13(597783/4639379) 14 100 (786/787) 100 (3394346/3395503) 15(516357/3394346) 15 100 (1533/1537) 100 (5208466/5212123) 17(868543/5208466) 16 100 (1186/1188) 100 (5177871/5178717) 19(1002244/5177871) 17 100 (1078/1081) 100 (4690593/4693044) 10(450682/4690593) 18 100 (1952/1958) 100 (6976052/6981044) 25(1732116/6976052) 19 100 (2076/2076) 100 (8342833/8342833) 14(1182843/8342833) 2 100 (1593/1598) 100 (7236744/7242000) 9(664736/7236744) 20 100 (569/570) 100 (2484254/2485544) 12(307858/2484254) 21 100 (892/896) 99 (3738173/3766981) 15(556627/3738173) 22 100 (1027/1032) 100 (4520633/4525211) 10(456486/4520633) 23 100 (1152/1153) 100 (3866972/3868343) 18(709417/3866972) 24 100 (554/554) 100 (2515278/2515278) 10(251265/2515278) 25 100 (1182/1183) 100 (4439937/4440745) 17(755517/4439937) 26 100 (716/719) 100 (2842268/2844173) 10(293528/2842268) 27 100 (421/422) 100 (1802775/1803409) 17(301302/1802775) 28 99 (528/531) 100 (2425027/2426437) 12(286890/2425027) 29 100 (1043/1048) 100 (3824056/3834534) 17(644781/3824056) 3 100 (2121/2127) 100 (7798555/7805777) 17(1320641/7798555) 4 100 (1288/1292) 100 (5537040/5541068) 13(716239/5537040) 5 100 (2124/2127) 100 (8413279/8414984) 18(1488746/8413279) 6 100 (1027/1027) 100 (4114550/4114550) 13(552589/4114550) 7 100 (2062/2065) 100 (7961434/7965089) 17(1380150/7961434) 8 99 (1186/1195) 100 (4956491/4963221) 12(595398/4956491) 9 100 (881/882) 100 (3810505/3811393) 11(436621/3810505) MT 0 (0/8) 0 (0/13964) 0(0/0) NKLS02000056.1 100 (2/2) 100 (1997/1997) 97(1939/1997) NKLS02000065.1 100 (1/1) 100 (483/483) 0(0/483) NKLS02000090.1 100 (1/1) 100 (3012/3012) 0(0/3012) NKLS02000101.1 100 (1/1) 100 (2267/2267) 100(2267/2267) NKLS02000125.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000137.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000210.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000218.1 100 (1/1) 100 (857/857) 0(0/857) . . . ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)

Then I find your reply in this issue https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/43#issuecomment-762525915 and you say"your numbers look good to me. "Genes with SIFT scores" and "Pos with SIFT Scores" are > 95%." So the database I constructed can be used？ Look forward to your reply. Thanks very much!

pauline-ng commented 2 years ago

Hi Peng Fei,

The log tells me:

your database is being constructed correctly
your organism's proteome doesn't have many homologues

Point 2 can be several reasons:

When predicting proteins in this genome, were liberal settings used? Hence, a large fraction of the proteins are not real, and won't have homologues, so SIFT can't make predictions.
Is your organism weird and distinct from other organisms that have sequenced? If it's weird, so there are not many homologous proteins in your protein database, SIFT won't be able to make predictions.

wpf95 commented 2 years ago

Thanks very much! I will go ahead and use it.

pauline-ng / SIFT4G_Create_Genomic_DB

'Pos with Confident Scores' is low in "CHECK_GENES.LOG" #65