sheynkman-lab / Long-Read-Proteogenomics

A workflow for enhanced protein isoform detection through integration of long-read RNA-seq and mass spectrometry-based proteomics.
MIT License
38 stars 16 forks source link

No full splice matches in protein classification #160

Closed tabeariepe closed 1 year ago

tabeariepe commented 2 years ago

Hi,

I run your pipeline on my dataset and until the sqanti protein step everything works fine. However, when I classify the proteins, no full splice matches are found (even though they are in the sqanti transcript output). Most transcript FSMs are classified as NNC with novel C-terminus. I tried to figure out where this comes from and I saw that for most transcript FSMs pr_cterm_diff is -3 in the sqanti protein output. Do you know what could cause this classification problem? Is it a problem of my input data (I start with a previously generated sqanti3 file)?

bj8th commented 2 years ago

We have not encountered this issue before, but it may be due to either an error in your input file, or a mismatch between the reference genomes used. If you can provide a small test file I can take a look into the issue.

tabeariepe commented 2 years ago

I checked the reference genomes that I used and and I did not notice any mismatches. I uploaded small test files of my sqanti output here: https://filesender.surf.nl/?s=download&token=242f4bd5-15e8-4919-8ffa-4a83a75484e7 It would be great if you can have a look at it.

As reference, I used gencode v39 primary assembly and the corresponding pc_translation.fa file.

tabeariepe commented 2 years ago

Hi, I figured out what causes the classification problem. It seems that for the reference coding sequence, the stop codon is not included while for the pacbio coding sequence it is included. Therefore, I get on offset of -3 for the pr_cterm_diff for most FSMs.