tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
64 stars 32 forks source link

Prediction of 5'ss usage and 3'ss usage #6

Open Witiy opened 2 years ago

Witiy commented 2 years ago

Hi! I have a question whether the prediciton performance of 5'ss usage and 3'ss usage are same? I think the 3ss usage is more diffcult to predict. So if is ok, please offer your test set. I will be very appreciate about it!

tkzeng commented 2 years ago

Hi,

Pangolin was trained to predict splice sites without distinguishing between 5' and 3' sites, so that information is not actually present in the test set. If you would like to test this, I would recommend downloading the dataset from the SpliceAI paper, https://basespace.illumina.com/s/otSPW8hnhaZR -> SpliceAI train code -> GTEx -> gtex_dataset.txt, which has the positions of 5' and 3' sites separately for each gene. Then you can get the sequence surrounding each site like:

import pyfastx
fasta = pyfastx.Fasta("GRCh37.primary_assembly.genome.fa.gz")
seq = fasta[chromosome][position-5001:position+4999].seq

and get scores for each splice site following the code in scripts/custom_usage.py.