pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

how to evaluate the siftddatabase? #43

Closed yywyaoyaowu closed 3 years ago

yywyaoyaowu commented 3 years ago

Hi, pauline This is Yaoyao. When I use a new genome, I met a 'warning'. Then I checked the database. My question 1. is the warning affect the database construction?

  1. how to evaluate if the database is well constructed?

  2. This is the warning I got. Bioperl for running DB::Fasta is quite important for sift database construction. But I don't see anything wrong. Is this oK? “Possible precedence issue with control flow operator at /home/yw2326/software/miniconda3/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805. done making single records template making noncoding records file”

  3. When I check the database as you suggested in the website "Check the database". I use zcat chr01.gz | grep CDS | awk '($11!="NA"){print}' | wc -l to check the sites with no sift score. I found 4.3% CDS sites were not with sift score. Then I got to "CHECK_GENES.LOG", I got "ALL 98 (43865/44746) 99 (117728116/118341687) 87(102554181/117728116)". And you said "Your database is done if the percentages are high for the last 3 different columns." I suppose 87% 99% is quite high, right? What's the number for good database? Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores chr01 98 (5715/5853) 99 (15569236/15662914) 87(13555622/15569236) chr02 98 (4245/4324) 100 (11980016/12034306) 89(10612715/11980016) chr03 98 (4342/4416) 99 (11824558/11885076) 89(10505875/11824558) chr04 98 (3980/4062) 99 (10687933/10741726) 86(9207919/10687933) chr05 99 (2917/2958) 100 (7889554/7918062) 87(6862649/7889554) chr06 98 (3887/3955) 100 (9922134/9968730) 86(8492265/9922134) chr07 97 (3059/3142) 99 (8457188/8515533) 88(7443496/8457188) chr08 98 (3188/3269) 99 (8486670/8544174) 88(7441767/8486670) chr09 98 (3194/3255) 100 (8552574/8591423) 88(7505256/8552574) chr10 98 (3157/3211) 100 (7853302/7890876) 87(6813450/7853302) chr11 98 (3029/3085) 100 (8336085/8374210) 85(7111593/8336085) chr12 98 (3152/3216) 99 (8168866/8214657) 86(7001574/8168866)

ALL 98 (43865/44746) 99 (117728116/118341687) 87(102554181/117728116) Look forward to your reply. Thanks very much!

pauline-ng commented 3 years ago

Hi Yaoyao,

  1. This is OK, warning is because scripts were written a long time ago.
  2. your numbers look good to me. "Genes with SIFT scores" and "Pos with SIFT Scores" are > 95%.

Your database looks good to me. I'd recommend to go ahead and use it.

yywyaoyaowu commented 3 years ago

Thanks very much!

wpf95 commented 2 years ago

Hi Yaoyao,

  1. This is OK, warning is because scripts were written a long time ago.
  2. your numbers look good to me. "Genes with SIFT scores" and "Pos with SIFT Scores" are > 95%.

Your database looks good to me. I'd recommend to go ahead and use it.

Hi pauline-ng, "Pos with Confident Scores" is 17, however "Genes with SIFT scores" and "Pos with SIFT Scores" are looks good.So,i am not sure if database can be used. 27 100 (421/422) 100 (1802775/1803409) 17(301302/1802775) Look forward to your reply. Thanks very much!