tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
82 stars 28 forks source link

ERROR: line with too many characters #51

Closed pgoikoetxea closed 2 years ago

pgoikoetxea commented 2 years ago

Dear Dr. Derrien: I find your pipeline very attractive for several reasons. I'm ttrying to run it on transcriptome data from a gymnosperm megagenome which is only partially sequenced.. I successfully run the first module in the pipeline (FILTER), but an error emerged when running the CODPOT module. The error is described in the header, and I paste here the first few lines, although I can send the complete output error if you wish

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the file must be less than 65,536 characters. Line 3510498 is 351472 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/pablo/Apps/Miniconda3/envs/feelnc/lib/site_perl/5.26.2/Bio/Root/Root.pm:447

. By logic, the error has originated from my genome file (fasta), from which I have extracted several lines around the relevant one with head and taail.. I would like to know whether this can be fixed, and how, but it strikes me that the previous sequence has 322377 bp but has not triggered the error. Thank you very much Pablo

tderrien commented 2 years ago

Dear @pgoikoetxea

Thank you for using FEELnc!

Actually, it could be related to the genome .fasta file not being correctly formatted with one (big) line per sequence.

Maybe the best would be to reformat your .fasta file before running the feelnc_codpot.pl such as:

cat genome.fa
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

fold -b -w 70 genome.fa
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDF
TIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEE
KIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKI
VEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVR
RFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Hope this helps.

Best regards,

Thomas

pgoikoetxea commented 2 years ago

Thank you very much for your fast answer. And YES, my fasta file, downloaded from treegenes.db.org is formatted as in your first example. Thank you very much for the code to format the lines. Best wishes Pablo