tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
82 stars 28 forks source link

mRNA training set #12

Closed jdmontenegro closed 7 years ago

jdmontenegro commented 7 years ago

Dear Sir/Madam,

I am very interested in using this tool in a non-model organism whose transcriptome was recently assembled. I understand that in order to do this, I need to supply a training set of reliable protein coding genes,hence my question: what would be the best approach to getting a training set? My first idea was to align the transcripts to a curated database like uniprot/swissprot and get trasncripts with >=75% identity over 90% of the transcript length. Then I remember that this would exclude coding proteins that are truncated either due to the assembly protocol, library prep or other reasons, so I figured I could use the prediction done by BUSCO as a training set. What do you think? Do you think this would be a sensible approach? Could you suggest some other approach?

Best regards,

Juan Montenegro

tderrien commented 7 years ago

Dear Juan,

Thank you for your interest in our tool and sorry for the delay in my reply! Actually, this really depends on the "non-model" organism you are working on, since it is possible that closely related species benefit from an accurate annotation of protein-coding genes (which in this case could be used as mRNA training set for FEELnc). Otherwise, if you know that protein sequences of you species of interest are included in known databases such as uniprot/swissprot, it could be worth trying to extract them and then, use them as mRNA training set. Hope this helps. All the best,

Thomas

jdmontenegro commented 7 years ago

Hi Thomas, thanks for your reply. In the end I used BUSCO to determine the ortholgs to universal single copy genes. The output of BUSCO included a list of 950 full-length protein-coding transcripts that I used for training. FEELnc worked really well afterwards, although the specificity/sensibility plots converged at a quite low value, so I used a higher protein-coding potential value to improve specificity at the cost of sensibility.

Cheers,