tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

Many protein coding transcripts in the reference annotation were classified as LncRNA ? #17

Closed yxlong032 closed 6 years ago

yxlong032 commented 6 years ago

Hi, I aimed to use FEElnc to profile the LncRNA-ome of porcine tissues. The whole command was as follows:

=========================================================

1)FEELnc_filter.pl -p 1 -i ./merged.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf -b transcript_biotype=protein_coding > candidate_lncRNA.gtf

2)FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf -b transcript_biotype=protein_coding -g ./Sus_11.1.fa --mode=shuffle

3)FEELnc_classifier.pl -i ./feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf > lncRNA_classes.txt

==========================================================

I found some transcript loci in the merged.gtf file, matching the protein coding transcript module of the reference annotation perfectly, were remained in the _candidatelncRNA.gtf file. The class_code was "=" in the merged.gtf. After codpot module, these transcript loci were also selected in the _candidatelncRNA.gtf.lncRNA.gtf file. How does this happen? What are these transcrpt loci? LncRNA or protein coding mRNA?

vwucher commented 6 years ago

Hi,

Concerning the filtering step, normally all transcripts in the 'merged.gtf' that overlap transcripts with the 'transcript_biotype=protein_coding' should have been removed. Have you check in your reference annotation 'Sus_scrofa.Sscrofa11.1.90.gtf' if the protein coding RNA are annotated exactly with the 'transcript_biotype=protein_coding' flag? Sometimes the flag can be different depending on the people how have made the annotation. Another way is to extract the annotation of the protein coding transcripts from your reference annotation and then run the 'filter' using the same command without the '-b transcript_biotype=protein_coding' option. It will check the overlap between your transcripts in the 'merged.gtf' file and all transcripts in the 'new' reference annotation.

For the 'coding potential' module, the benchmarck of this tool didn't lead to 'perfect' results, as the other tools. So using either FEELnc or another tool to predict the biotype of transcripts can lead to false prediction. But here, because you use the '--mode=shuffle' I would advise to use it simultaneously with the option '--spethres=CODspe,NONspe' with CODspe and NONspe the wanted prediction specificity respectively for the coding and non-coding transcripts. This option allow you to get more stringent predictions for the mRNAs and the lncRNAs and to get a new class, the TUCPs, i.e. the transcripts with a coding potential (the FEELnc score) between the mRNAs and lncRNAs. To choose the specificity thresholds, you can check the '{INPUT}_RF_TGROC.png' plot made on the 10-fold cross-validation on the learning data.

Moreover, I saw that you still get the error for the 'filter' module on an another issue thread (I think it is you). But here you get the result of the 'filter' module. Did you finally succeed to solve this issue?

Thanks, Valentin Wucher