pedronachtigall / CodAn

CDS prediction in transcripts
Other
22 stars 4 forks source link

Pep file for partial model #6

Open jwasmuth opened 4 years ago

jwasmuth commented 4 years ago

I was excited to read the performance of CodAn in your paper, so I am trying it out. I carried out the tutorial and noticed that when using the FULL model a Pep file was generated, but not when using the PARTIAL model. I could not find an option for to generate it. Is there a reason for this? Thanks James

pedronachtigall commented 4 years ago

Hi @jwasmuth ,

Unfortunately, we only implement a translation function for FULL CDSs due to the confident identification of the translation frame and peptide sequence generated. We decided to not implement a translation function for PARTIAL transcripts because some of the transcripts (i.e., with no start codon and no stop codon) may be hard to decide the best translation frame based on our pipeline. In this sense, we decided to let this decision to the user in its downstream analysis.

Tip: you can design a script to keep all CDSs starting with a start codon (ATG) to be translated by frame 1, whereas all other PARTIAL transcripts can be set to be translated through frame 1, 2 e 3 in the PLUS strand. Then, checks if the CDS ends with a stop codon (TAA, TGA, or TAG), if yes, you know that the translated sequence should end with an "*" (or any other signal, depending on the tool you are using). If the CDS does not end with a stop codon, you just iterate through all frames from that CDS and identify the peptide sequence with no stop codon signal in the middle. But keep in mind that maybe all tree frames from this PARTIAL CDS may have no stop codon signal in the middle, here is the hard decision to be made "what frame should be considered?". PS1: You can use transeq or biopython module or any other tool to perform the translation. PS2: when I have time I will write this script and make it available at https://github.com/pedronachtigall/CodAn/tree/master/scripts/

Best regards, Pedro

jwasmuth commented 4 years ago

Thanks @pedronachtigall for the explanation. I had assumed that for the partial model, CodAn would generate an ORF that was in frame 1. My mistake. Your 'tip' makes sense to me. If the ORF is long enough, I would be surprised if the 'incorrect' frames did not have internal stop codons. Something, we'll look into.

A long, long time ago, I wrote prot4EST to correct errors in EST-based transcriptomes (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-187). It performed well but is much too slow for the current RNA-Seq transcriptomes and I am excited by the benchmarking in your CodAn paper.

Best wishes James

pedronachtigall commented 4 years ago

Hi @jwasmuth ,

You are welcome and thanks for contacting me. You are right, if the ORFs are long enough they will have a high probability of presenting stop codons in the middle if it is the wrong frame. But shorter ORFs may present no stop codon in the middle in all frames.

Furthermore, I designed the script mentioned and it can be downloaded here (TranslatePartial.py). It basically follows my "tip" from the previous answer. But it has an additional check in the case the partial starts with an "ATG" that it is not the "correct" start codon. I performed a few tests, and it is returning satisfactory results. I hope it helps you with your analysis.

Thank you for your excitement and for using CodAn in your analysis. Contact me if you have any trouble or feedback.

Best regards, Pedro

margaretc-ho commented 1 year ago

Hi Pedro @pedronachtigall ,

Thanks for CodAn, its a great ORF prediction program and it has been very useful to us. We just recently observed some weird behavior from TranslatePartial.py when running it on a ORF_sequences.fa file output by CodAn, that I am wondering about, in that it fails to translate some sequences in the resulting ORF_sequences.pep file. I have rerun it several times and am missing about 60 sequences that are in the ORF_sequences.fa but not the resulting pep file. I can email the files to you if you would like to take a look. Please let me know what you think. Thank you

M