tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

FEELnc_classifier.pl issue when determining antisense transcripts with GTF-file versions #44

Closed cc-prolix closed 3 years ago

cc-prolix commented 3 years ago

I am trying to classify antisense transcripts using FEELnc_classifier.pl with a GENCODE GTF file as a reference annotation. When using the GENCODE reference on transcript level (only including transcript & exon entries) I am getting wildly different results compared to the ones when using a complete reference file (including gene entries and others). How is the FEELnc_classifier.pl supposed to be used and what input files are recommended? Thank you!

vwucher commented 3 years ago

Hi,

Thanks for using FEELnc. Can you send us a small example of what is happening? So we can check it.

Thanks in advance, Valentin

cc-prolix commented 3 years ago

Hey, I am only using the FEELnc_classifier.pl as follows: FEELnc_classifier.pl -i some_lncRNA_candidate.gtf -a GENCODE_annotation_chr38.gtf > candidate_lncRNA_classes.txt

The GENCODE reference annotation GTF file contains protein coding genes:

seqnames start end width strand source type score ... chr1 65419 71585 6167 + HAVANA gene ... chr1 65419 71585 6167 + HAVANA transcript ... chr1 65419 65433 15 + HAVANA exon ... ... ... ... ... ... ... ...

The candidate GTF file contains lncRNA information in the following form:

seqnames start end width strand source type score ... chr1 xxx yyy zzz + DB transcript ... chr1 aaa bbb ccc + DB exon ... chr1 qqq www eee + DB exon ... ... ... ... ... ... ... ...

I created a subset of the resulting classifier output file containing only unique gene IDs of best hits, with a "antisense" direction and "genic" type. This results in x unique gene IDs.

When repeating the process with the GENCODE reference file on transcript level (without gene entries)...

seqnames start end width strand source type score ... chr1 65419 71585 6167 + HAVANA transcript ... chr1 65419 65433 15 + HAVANA exon ... ... ... ... ... ... ... ...

...the resulting subset contains far less (y) unique gene IDs. I thought I would get the same number of unique gene IDs. Could you tell me why this is the case and how the FEELnc_classifier.pl is supposed to be used in this situation? Thank you for your help!

vwucher commented 3 years ago

Hi,

Normally, you should get the same numbers of genes, because the overlap is done per transcript. So if you didn't filter the transcripts, i.e. you have the same list of transcripts, you should get the same results. That is why I was wondering if you can send a minimal example to reproduce the behaviour you found, to know why it is the case.

Thanks, Valentin

vwucher commented 3 years ago

Hi again,

I tried with the toy example from the git repository to do the classification using the annotation with and without the "gene" lines, I have the same result. The only exception is the "isBest" column (the first one). Did you filter by this column, taking only the best ones? If yes then it can come from this. Otherwise, send us a toy example and we will look at it.

Thanks, Valentin

cc-prolix commented 3 years ago

Hey, I did a little bit of testing and it seems like the behaviour was caused because I filtered by the "isBest" column like you said. Thank you very much for your help!

vwucher commented 3 years ago

Hi,

You are welcome. Glad that you found the issue!

Don't hesitate if you need anything else, Valentin