tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

FEELnc_codpot - Illegal division by zero at RandomForest.pm line 481 #37

Closed hxk298 closed 5 years ago

hxk298 commented 5 years ago

Hello,

I am trying to run FEELnc_codpot, and I am getting two types of errors. The first error occurs after parsing the genome annotation file. I get the following message:

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

I think this message is coming up because of short sequences that BioPerl can't decipher as DNA, RNA, or protein. However, I'm not sure what I need to edit to manually input that the alphabet should be set as DNA. In case this information is relevant, the genome annotation file was converted from gff to gtf using "gffread -T", and I edited the seqname to match the chromosome names of the genome sequence file. However, I noticed that the resulting output file is missing any attribute information other than gene_id, transcript_id, and some gene_name. So, all exon_number and other information were lost.

The second error occurs after parsing the candidate lncRNA file made from FEELnc_filter. I get the following message stating that there was an illegal division by zero at line 481. I thought this error was due to transcripts with length of 0, so I tried deleting such transcripts from the filtered gtf, but this problem still persists.

Run random Forest on '/tmp//5646_candidate_lncRNA.test_rna.fa' Illegal division by zero at /media/jolly/WD/FEELnc/lib/RandomForest.pm line 481, line 12196.

Do you have any suggestions on how to resolve these issues?

Thank you, HK

vwucher commented 5 years ago

Hi,

Sorry for the late reply. Can you send us the files that trow these issues? Preferably a minimal example if it is possible. Concerning the short sequence, what are the size of your sequences? Normally there shouldn't be sequence smaller than 200 nt after the filter. You can also run FEELnc_codpot.pl using the '--keeptmp' option to keep the temporary files and check if some sequences have a length of 0.

Thanks, Valentin

hxk298 commented 5 years ago

Hello,

After using FEELnc_filter, the transcript length seem to vary from 0 to 999. I also used '--keeptmp' option for FEELnc_codpot, and I think the warnings may be due to the genome annotation file. The original gff genome annotation that I downloaded does not contain any 'exon' for its third column, but exon numbers are given in column 9. So, when I used gffread to convert the gff to gtf, some 'CDS' somehow got converted into 'exon'. Therefore, in my annotation, I have some genes that only have CDS features and some that have both exon and CDS features.

I looked at the tmp files and saw that after parsing the annotation file, 'coding_orf.fa' and 'coding_rna.fa' are give. The transcripts with both exon and CDS features gave sequences in both files. However, the transcripts with only CDS results in sequence in ORF file, but not in the RNA file. I think I m getting "Got a sequence without letters" warning due to the lack of sequences in the RNA file, and "Illegal division by zero" might be from these transcripts having sequence length of 0.

I sent you some sample files in email. Please let me know if you want any further information or additional files.

Thank you, HK

hxk298 commented 5 years ago

Update on this issue:

The transcript length issue was my mistake. I was getting the length of each exon instead of the entire transcript. So, when the actual transcript lengths were calculated, they ranged from 198 to tens of thousands.

For the issue regarding the codpot, it was due to the annotation. I contacted the institution that manages the genome annotation, and I was notified that the annotation on their database is not up-to-date. So, I downloaded the most current version of the annotation with the exon data from a recommended link. With the new annotation, FEELnc_codpot worked and outputted gtf files for mRNA, lncRNA, and noORF. So, I think the issue was due to the missing "exons" from the annotation, which caused a problem when calculating the ORF percentage of each transcript.

Thank you so much for all the help!

Best, HK