tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
82 stars 28 forks source link

Misclassification of partner RNA genes #38

Closed kritikakarri closed 4 years ago

kritikakarri commented 5 years ago

Hello, I was using the Feelnc classification filter on my list of lncRNAs and refseq annotation for mouse genome. And I am using the following format for my gtf file: 4 . exon 61545496 61545496 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_7"; exon_number "1"; 4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_4"; exon_number "1"; 4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_8"; exon_number "1"; 4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_1"; exon_number "1"; 4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_2"; exon_number "1";

and the reference gtf file is similar too.

The issue the feelnc classification does correctly find its antisense gene partners Mup3 and Mup20 (NR_149826; NM_001012323;) but in addition it also predicts two other genes that are not overlapping and on the same chromosome too but still it gives classifies them as intronic with a wrong distance of zero. NR_002445 which is chromosome 16 but still I get a distance of zero and its predicted intronic. Another such instance is a lncRNAs that on chr5: 136,150,706- 136,166,245 is actually anti-sense to Por (and feeLnc outputs that correctly, but it also gives as best predicted output result for LOC102631757 (with a distance of zero and predicted intronic), which is located approx. 7 kb away on Chr5, and is also found on another chromosome.

I have no idea why this would be happening. Does it have something to do with the format of my reference file? Because I saw the test files provided for human datasets doesn't mention the chromosome number but the version of the assembly.

vwucher commented 5 years ago

Hello,

Effectively, this is weird. Can you send us the lines in your GTF of all the genes that you cite in your message? To help use to find the issue. And also sending us the command line you used to run the analysis. To extract the lines, maybe use the grep command line, in order to check if there is no issue with other genes with the same name or stuff like this. And for the reference file in the test directory, it is the chromosome number, but in dog. So the first field should be the chromosome, as your files and ours :).

Thanks, Valentin