pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

'Pos with Confident Scores' is low in "CHECK_GENES.LOG" #65

Closed wpf95 closed 2 years ago

wpf95 commented 2 years ago

Hi, pauline This is pengfei. I constructed a new datavase. Then I checked the database. I go to "CHECK_GENES.LOG", then got "ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)". And you said "Your database is done if the percentages are high for the last 3 different columns." But the 'Pos with Confident Scores' is low. Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores 1 99 (1536/1544) 100 (6106447/6111160) 17(1010023/6106447) 10 100 (1583/1586) 100 (6491552/6493442) 12(793428/6491552) 11 100 (1656/1661) 100 (6918920/6922883) 14(994782/6918920) 12 100 (621/623) 100 (2720527/2721496) 11(308259/2720527) 13 100 (1279/1284) 100 (4639379/4642032) 13(597783/4639379) 14 100 (786/787) 100 (3394346/3395503) 15(516357/3394346) 15 100 (1533/1537) 100 (5208466/5212123) 17(868543/5208466) 16 100 (1186/1188) 100 (5177871/5178717) 19(1002244/5177871) 17 100 (1078/1081) 100 (4690593/4693044) 10(450682/4690593) 18 100 (1952/1958) 100 (6976052/6981044) 25(1732116/6976052) 19 100 (2076/2076) 100 (8342833/8342833) 14(1182843/8342833) 2 100 (1593/1598) 100 (7236744/7242000) 9(664736/7236744) 20 100 (569/570) 100 (2484254/2485544) 12(307858/2484254) 21 100 (892/896) 99 (3738173/3766981) 15(556627/3738173) 22 100 (1027/1032) 100 (4520633/4525211) 10(456486/4520633) 23 100 (1152/1153) 100 (3866972/3868343) 18(709417/3866972) 24 100 (554/554) 100 (2515278/2515278) 10(251265/2515278) 25 100 (1182/1183) 100 (4439937/4440745) 17(755517/4439937) 26 100 (716/719) 100 (2842268/2844173) 10(293528/2842268) 27 100 (421/422) 100 (1802775/1803409) 17(301302/1802775) 28 99 (528/531) 100 (2425027/2426437) 12(286890/2425027) 29 100 (1043/1048) 100 (3824056/3834534) 17(644781/3824056) 3 100 (2121/2127) 100 (7798555/7805777) 17(1320641/7798555) 4 100 (1288/1292) 100 (5537040/5541068) 13(716239/5537040) 5 100 (2124/2127) 100 (8413279/8414984) 18(1488746/8413279) 6 100 (1027/1027) 100 (4114550/4114550) 13(552589/4114550) 7 100 (2062/2065) 100 (7961434/7965089) 17(1380150/7961434) 8 99 (1186/1195) 100 (4956491/4963221) 12(595398/4956491) 9 100 (881/882) 100 (3810505/3811393) 11(436621/3810505) MT 0 (0/8) 0 (0/13964) 0(0/0) NKLS02000056.1 100 (2/2) 100 (1997/1997) 97(1939/1997) NKLS02000065.1 100 (1/1) 100 (483/483) 0(0/483) NKLS02000090.1 100 (1/1) 100 (3012/3012) 0(0/3012) NKLS02000101.1 100 (1/1) 100 (2267/2267) 100(2267/2267) NKLS02000125.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000137.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000210.1 0 (0/0) 0 (0/0) 0(0/0) NKLS02000218.1 100 (1/1) 100 (857/857) 0(0/857) . . . ALL 100 (37304/37410) 100 (148558782/148683715) 15(22342363/148558782)

Then I find your reply in this issue https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/43#issuecomment-762525915 and you say"your numbers look good to me. "Genes with SIFT scores" and "Pos with SIFT Scores" are > 95%." So the database I constructed can be used? Look forward to your reply. Thanks very much!

pauline-ng commented 2 years ago

Hi Peng Fei,

The log tells me:

  1. your database is being constructed correctly
  2. your organism's proteome doesn't have many homologues

Point 2 can be several reasons:

wpf95 commented 2 years ago

Thanks very much! I will go ahead and use it.