pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Using human test files: 'Pos with Confident Scores' is low in "CHECK_GENES.LOG" #66

Closed Melanie-Wilkinson closed 2 years ago

Melanie-Wilkinson commented 2 years ago

Hi Pauline,

What does it mean when the 3rd column of CHECK_GENES.LOG is low for the human test?

Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores 21 99 (810/822) 100 (2261496/2263334) 68(1542519/2261496) MT 100 (7/7) 100 (12241/12241) 18(2147/12241)

ALL 99 (817/829) 100 (2273737/2275575) 68(1544666/2273737)

There were no errors in the run (below) but no singleRecords_with_scores were created.

entered mkdir ./test_files/homo_sapiens_small/GRCh38.83 converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /g/data/ht96/Mel_UQ/sift2/sift4g/bin/sift4g -d /g/data/ht96/Mel_UQ/sift2/SIFT_database/uniref90.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results done getting all the scores populating databases checking the databases zipping up ./test_files/homo_sapiens_small/chr-src/* All done!

pauline-ng commented 2 years ago

Hi Melanie,

Positions with low confidence means the set of homologous sequences that the SIFT prediction is based up is not diverse enough. You may see a lot of 'damaging' predictions for proteins on these chromosomes, and it's due to not enough diversity in the sequence alignment.

Maybe try running sift4g with swissprot-trembl as your protein database and see if that improves the number?

Melanie-Wilkinson commented 2 years ago

Thank you for confirming that that is what is expected for the human test using uniref90.fasta.

I wasn't getting .regions files for both the test and mango but now after loading the module python2, everything seems fixed.

For mango I'm now getting mostly >88 for 'Pos with Confident Scores' so uniref90.fasta seems appropriate for mango.

pauline-ng commented 2 years ago

Great! Glad it worked out.