pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Questions regarding the results #56

Closed yaada100 closed 2 years ago

yaada100 commented 2 years ago

Hello Pauline,

I have a couple of questions regarding the protein alignment results. If i have understood it correct, the queries(in SIFT_prediction folder) used in the files are acquired from the gtf file, right? But they do vary from the sequences found in the gtf file. It seems, like there has been a sequence which has been inserted.

  1. Can you explain to me or link me to an explanation on what change is done and how these sequences are created/chosen as query?
  2. Also if the change that is done is just the longest increasing subsequence algorithm, then why does the start and end seem to not change?
  3. Lastly, this Query sequence is locally pairwise aligned(SW) with the collected sequences and sift scores are calculated. But is this sequence, which is found in the files(*.aligned.fasta) the whole sequence? Because the whole sequence is aligned, it seems more like a global alignment. For example as in here(Homo sapien file):

Found in ENST00000628202.aligned.fasta: MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQVHTQHPLFEGGICAPCKDKFLDALFLYDDDGYQSYCSICCSGETLLICGNPDCTRCYCFECVDSLVGPGTSGKVHAMSNWVCYLCLPSSRSGLLQRRRKWRSQLKAFYDRESENPLEMFETVPVWRRQPVRVLSLFEDIKKELTSLGFLESGSDPGQLKHVVDVTDTVRKDVEEWGPFDLVYGATPPLGHTCDRPPSWYLFQFHRLLQYARPKPGSPRPFFWMFVDNLVLNKEDLDVASRFLEMEPVTIPDVHGGSLQNAVRVWSNIPAIRSRHWALVSEEELSLLAQNKQSSKLAAKWPTKLVKNCFLPLREYFKYFSTELTSSL length of sequences: 386

Found in Homo_sapiens.GRCh38.pep.all.fa:

ENSP00000486001.1 pep:known chromosome:GRCh38:21:44246352:44261890:-1 gene:ENSG00000142182.8 transcript:ENST00000628202.2 gene_biotype:protein_coding transcript_biotype:protein_coding

MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQMAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQVHTQHPLFEGGICAPCKDKFLDALFLYDDDGYQSYCSICCSGETLLICGNPDCTRCYCFECVDSLVGPGTSGKVHAMSNWVCYLCLPSSRSGLLQRRRKWRSQLKAFYDRESENPLEMFETVPVWRRQPVRVLSLFEDIKKELTSLGFLESGSDPGQLKHVVDVTDTVRKDVEEWGPFDLVYGATPPLGHTCDRPPSWYLFQFHRLLQYARPKPGSPRPFFWMFVDNLVLNKEDLDVASRFLEMEPVTIPDVHGGSLQNAVRVWSNIPAIRSRHWALVSEEELSLLAQNKQSSKLAAKWPTKLVKNCFLPLREYFKYFSTELTSSL length of sequences: 446

Insert : MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQ

pauline-ng commented 2 years ago

Hi Yaada,

Our code takes the gtf file, parses the coding regions described in the gtf file (along with frame and orientation), retrieves the DNA sequence, and translates the DNA sequence into the protein.

If the Homo_sapiens.GRCh38.pep.all.fa doesn't match what our code is parsing from the gtf file, then perhaps the gtf file sin't the same version as Homo_sapiens.GRCh38.pep.all.fa

yaada100 commented 2 years ago

Hello,

First of all thank you for your answer. But if I got it right, there are 1504 varying protein coding transcript ids in the gtf file. And only 826 aligned.fasta files (in SIFT_predictions) for the homo sapien genome. So 676 transcript ids have been disregarded.

So I am confused about some aspects:

  1. On what criteria are the varying transcript ids chosen?
  2. There are multiple entries for the same transcript id, with varying regions. How is the region which will be translated and aligned chosen.
  3. Is the SIFT score calculated by just taking the aminoacid sequence into consideration or is it computed with the base sequence?
  4. And lastly, two "Query genomes" have been provided, Homo_sapiens.GRCh38.dna.chromosome.21.fa and Homo_sapiens.GRCh38.dna.chromosome.MT.fa. Is one of those just the mitochondrial DNA and how does the script work with two Queries? Are the results significant for just one file("Query") or both, if so how?

Thank you in advance for your help.

pauline-ng commented 2 years ago

Hi Yaada,

If you're interested in GRCh38, you can use our pre-computed predictions located here

The alignedfasta files are intermediate files -- please use the final database that's generated.