Closed yaada100 closed 2 years ago
Hi Yaada,
Our code takes the gtf file, parses the coding regions described in the gtf file (along with frame and orientation), retrieves the DNA sequence, and translates the DNA sequence into the protein.
If the Homo_sapiens.GRCh38.pep.all.fa doesn't match what our code is parsing from the gtf file, then perhaps the gtf file sin't the same version as Homo_sapiens.GRCh38.pep.all.fa
Hello,
First of all thank you for your answer. But if I got it right, there are 1504 varying protein coding transcript ids in the gtf file. And only 826 aligned.fasta files (in SIFT_predictions) for the homo sapien genome. So 676 transcript ids have been disregarded.
So I am confused about some aspects:
Thank you in advance for your help.
Hi Yaada,
If you're interested in GRCh38, you can use our pre-computed predictions located here
The alignedfasta files are intermediate files -- please use the final database that's generated.
Hello Pauline,
I have a couple of questions regarding the protein alignment results. If i have understood it correct, the queries(in SIFT_prediction folder) used in the files are acquired from the gtf file, right? But they do vary from the sequences found in the gtf file. It seems, like there has been a sequence which has been inserted.
Found in ENST00000628202.aligned.fasta: MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQVHTQHPLFEGGICAPCKDKFLDALFLYDDDGYQSYCSICCSGETLLICGNPDCTRCYCFECVDSLVGPGTSGKVHAMSNWVCYLCLPSSRSGLLQRRRKWRSQLKAFYDRESENPLEMFETVPVWRRQPVRVLSLFEDIKKELTSLGFLESGSDPGQLKHVVDVTDTVRKDVEEWGPFDLVYGATPPLGHTCDRPPSWYLFQFHRLLQYARPKPGSPRPFFWMFVDNLVLNKEDLDVASRFLEMEPVTIPDVHGGSLQNAVRVWSNIPAIRSRHWALVSEEELSLLAQNKQSSKLAAKWPTKLVKNCFLPLREYFKYFSTELTSSL length of sequences: 386
Found in Homo_sapiens.GRCh38.pep.all.fa:
ENSP00000486001.1 pep:known chromosome:GRCh38:21:44246352:44261890:-1 gene:ENSG00000142182.8 transcript:ENST00000628202.2 gene_biotype:protein_coding transcript_biotype:protein_coding
MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQMAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQVHTQHPLFEGGICAPCKDKFLDALFLYDDDGYQSYCSICCSGETLLICGNPDCTRCYCFECVDSLVGPGTSGKVHAMSNWVCYLCLPSSRSGLLQRRRKWRSQLKAFYDRESENPLEMFETVPVWRRQPVRVLSLFEDIKKELTSLGFLESGSDPGQLKHVVDVTDTVRKDVEEWGPFDLVYGATPPLGHTCDRPPSWYLFQFHRLLQYARPKPGSPRPFFWMFVDNLVLNKEDLDVASRFLEMEPVTIPDVHGGSLQNAVRVWSNIPAIRSRHWALVSEEELSLLAQNKQSSKLAAKWPTKLVKNCFLPLREYFKYFSTELTSSL length of sequences: 446
Insert : MAAIPALDPEAEPSMDVILVGSSELSSSVSPGTGRDLIAYEVKANQRNIEDICICCGSLQ