tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

lncRNA gene counts #50

Closed JasminJ1 closed 2 years ago

JasminJ1 commented 2 years ago

Hi,

I am trying to find gene counts from lncRNAs to create a lncRNA signature which can differentiate between patients with different conditions and I want to use htseq-count to count lncRNA reads in the samples that I have. Htseq-count of course requires you to provide a gtf file, however, it is unclear to me which one I should provide to count lncRNA reads specifically. Should I use the (gencode) full annotation file, combine this with the feelnc lncRNAs which have been found using the codpot module, let htseq-count do its work and then select all lncRNAs to see what the specific counts of these reads are? Or should I use the discovered lncRNAs from feelnc only?

Thanks in advance!

tderrien commented 2 years ago

Hi,

Thank you for using FEELnc. You could (should) use the combined files with Gencode + novel lncRNAs (+ novel mRNAs). Best,

Thomas

JasminJ1 commented 2 years ago

Hi @tderrien,

I had another question: to be more sure that the novel lncRNAs and the novel noORF transcripts were indeed lncRNAs, I created a FASTA file with gffread from the novel lncRNA GTF and novel noORF GTF that was outputted by feelnc codpot. I merged these and used these as input for blastx to see if any transcripts aligned to protein coding queries. I then removed the protein aligning FASTAs. However, I of course need to have a GTF file as input for HTSEQ-count. So, would it be correct to use the transcript IDs from the FASTA file that has been filtered by blastx, and to remove these from the GTF files that have been outputted by feelnc codpot? Or do I have to somehow use the new FASTA files to create GTF files?

Many thanks!

tderrien commented 2 years ago

Hi @JasminJ1 ,

Yes, you could remove the transcripts from the .gtf matching protein-coding databases based on the transcript_id. But, keep in mind that you may miss true lncRNAs just because you removed transcripts having a (good?) match with protein-coding databases. Always the tradeoff bw specificity and sensitivity. Cheers

Thomas