tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
78 stars 27 forks source link

Error when running FEELnc_codpot #10

Closed kevin199011 closed 6 years ago

kevin199011 commented 7 years ago

When I was testing the installment and running with the test dataset, FEELnc_codpot.pl -i candidate_lncRNA.gtf -a annotation_chr38.gtf -b transcript_biotype=protein_coding -g genome_chr38.fa --mode=shuffle

I have this error:

Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/local/share/perl5/Bio/DB/Fasta.pm line 295

Could you please let me know what's the possible reason for this?

Thank you!

Kevin

tderrien commented 7 years ago

Hello Kevin,

I suspect a pb with the genome index file. Could you try to remove it e.g: rm genome_chr38.fa.index and then relaunch the FEELnc_codpot.pl module test.

Best,

Thomas

krawza commented 7 years ago

I also get the same error. Does anyone can solved this problem yet? I tried delete the index file and relaunch but can't fix this

Thank you, Kraw

tderrien commented 7 years ago

Hello,

Thank you for the information. Could you indicate which versions of Perl and bioperl you are using? Best, Thomas

PS: as mentioned in the README, FEELnc has been tested with Perl 5.18.2 and BioPerl 1.6.924.

krawza commented 7 years ago

my perl version is perl -version This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi (with 58 registered patches, see perl -V for more detail)

For Bioperl version i'm not sure how to check it but i think it's the newest version

However,I'm very newbie in bioinformatics and Unix so if I'm give the wrong information. I'm sorry

Thank you, Kraw

tderrien commented 7 years ago

Hi Kraw, For bioperl version, you can use perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' Thanks Thomas

wangyahuiSinobiocore commented 7 years ago

Hi Kraw, When I use FEELnc_codpot.pl
FEELnc_codpot.pl -i candiate_lncRNA.gtf -a ../../human-ref/Homo_sapiens.GRCh38.88.gtf -g ../../human-ref/Homo_sapiens.GRCh38.dna.toplevel.fa -b transcript_biotype=protein_coding -b transcript_status=KNOWN --mode=shuffle get this error

Run random Forest on '/tmp//23071_candiate_lncRNA.gtf.test_rna.fa'

  1. Compute the size of each sequence and ORF
  2. Compute the kmer ratio for each kmer and put the output file name in a list
  3. Compute the kmer score for each kmer size on learning and test ORF Scoring ORF file: model of log ratio score file '/tmp//23071_candiate_lncRNA.gtf.kmerScoreValues_size12.tmp' is empty... exiting

Could you please let me know what's the possible reason for this?

Thank you!

vwucher commented 7 years ago

Hello,

In order to better understand this error, can you tell us how much transcripts are in the candiate_lncRNA.gtf file and the approximate size of these transcripts? Also, can you run FEELnc_codpot.pl with this supplementary option '--kmer="1,2,3,6,9"' and tell us if it works? It will remove the 12-mer from the prediction step.

Thanks, Valentin

wangyahuiSinobiocore commented 7 years ago

Hello , I have get 32108 transcripts in the candiate_lncRNA.gtf .The filter code

nohup FEELnc_filter.pl -i merge.gtf -a ../../human-ref/Homo_sapiens.GRCh38.88.gtf - s 200 -b transcript_biotype=protein_coding --monoex 0 -p 40 -o FEE-filter.log 1>candiate_lncRNA.gtf 2>lncRNA-candiate.log&

We have run FEELnc_codpot.pl with this supplementary option '--kmer="1,2,3,6,9"', FEELnc_codpot.pl -i candiate_lncRNA.gtf -a ../../human-ref/Homo_sapiens.GRCh38.88.gtf -g ../../human-ref/Homo_sapiens.GRCh38.dna.toplevel.fa -b transcript_biotype=protein_coding -b transcript_status=KNOWN --mode=shuffle -k 1,2,3,6,9 but get this error:

Extract ORFs/cDNAs for candidates RNAs from a GTF file Parsing file 'candiate_lncRNA.gtf'... Parse input file: [----------------------------------------------------------------------------------------------------] Your input GTF file 'candiate_lncRNA.gtf' contains 32108 transcripts Extracting ORFs/cDNAs 32106/32108... Extracted '32106' ORF/cDNAs sequences on '32108'. Run random Forest on '/tmp//27279_candiate_lncRNA.gtf.test_rna.fa'

  1. Compute the size of each sequence and ORF Get size: input file '/tmp//27279_candiate_lncRNA.gtf.noncoding_orf.fa.forRandomForest.fa' is empty... exiting Thanks. 2017-05-11

wangyahui

发件人:vwucher notifications@github.com 发送时间:2017-05-11 17:22 主题:Re: [tderrien/FEELnc] Error when running FEELnc_codpot (#10) 收件人:"tderrien/FEELnc"FEELnc@noreply.github.com 抄送:"wangyahuiSinobiocore"wangyahui@sinobiocore.com,"Comment"comment@noreply.github.com

Hello, In order to better understand this error, can you tell us how much transcripts are in the candiate_lncRNA.gtf file and the approximate size of these transcripts? Also, can you run FEELnc_codpot.pl with this supplementary option '--kmer="1,2,3,6,9"' and tell us if it works? It will remove the 12-mer from the prediction step. Thanks, Valentin — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

vwucher commented 7 years ago

Ok, thanks for the update.

Again, can you check if the removing of the option "-b transcript_status=KNOWN" make it work?

And can you also check the files below and tell us for each one if there are empty or not?

Thanks, Valentin

wangyahuiSinobiocore commented 7 years ago

I'm glad to tell you the good news that I have used -k 1,2,3,6,9 and "-b transcript_status=KNOWN again.I get the result :

Summary file:

-With_cutoff: 0.358 -Nb_lncRNAs: 26928 -Nb_mRNAs: 5178 so wonderful. I have another question.The input file of FEELnc_filter.pl is the gtf from stringtie merge .If I should delete known transcript, with the result of the new assembly.

Thanks.

2017-05-11

wangyahui

发件人:vwucher notifications@github.com 发送时间:2017-05-11 20:46 主题:Re: [tderrien/FEELnc] Error when running FEELnc_codpot (#10) 收件人:"tderrien/FEELnc"FEELnc@noreply.github.com 抄送:"wangyahuiSinobiocore"wangyahui@sinobiocore.com,"Comment"comment@noreply.github.com

Ok, thanks for the update. Again, can you check if the removing of the option "-b transcript_status=KNOWN" make it work? And can you also check the files below and tell us for each one if there are empty or not?

vwucher commented 7 years ago

Ok, that's good news :).

Did you get this result with the -k 1,2,3,6,9 option and without -b transcript_status=KNOWN or with the two options?

For your question, you want to known if you should remove already annotated transcripts? The filter will only keep transcripts from the input GTF that do not overlap transcripts from the reference annotation you give (Homo_sapiens.GRCh38.88.gtf in you case). If you want to be the more stringent, it is maybe better to remove all transcripts from stringtie that overlap this reference annotation as protein_coding by also non-coding RNA as ribosomal RNA or microRNA. But if you only want to remove the new transcripts that overlap protein_coding, the way you have done it is the good way.

If I didn't understand the question very well tell me.

Valentin

f-lamanna commented 7 years ago

Hello,

I get the the following error when I run FEELnc_codpot.pl with GTF files:

FEELnc_codpot.pl -i lamprey.lncrna.merged.filter.gtf -a ../../PMZ_v3.1_final/PMZ_v3.1_genes.gtf -g ../../../FINAL_Lamprey_assembly_072816/Lamprey_final_assembly_07_15_2016.fasta --mode=shuffle

Your input GTF file 'PMZ_v3.1_genes.gtf' contains *20950* transcripts
Extracting ORFs/cDNAs 490/20950...

------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Failed validation of sequence '[unidentified sequence]'. Invalid characters were: >_ STACK: Error::throw STACK: Bio::Root::Root::throw /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/Root/Root.pm:447 STACK: Bio::PrimarySeq::validate_seq /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/PrimarySeq.pm:338 STACK: Bio::PrimarySeq::_set_seq_by_ref /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/PrimarySeq.pm:287 STACK: Bio::PrimarySeq::seq /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/PrimarySeq.pm:272 STACK: Bio::PrimarySeq::new /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/PrimarySeq.pm:229 STACK: Bio::Seq::new /home/hd/hd_hd/hd_cc141/local/lib/perl5/5.24.0/Bio/Seq.pm:496 STACK: Orf::orfSeq2orfOb /home/hd/hd_hd/hd_cc141/Programs/FEELnc/lib/Orf.pm:290 STACK: ExtractCdnaOrf::CreateORFcDNAFromGTF /home/hd/hd_hd/hd_cc141/Programs/FEELnc/lib/ExtractCdnaOrf.pm:342 STACK: /home/hd/hd_hd/hd_cc141/Programs/FEELnc/scripts/FEELnc_codpot.pl:298

the error does not occur if use fasta instead of GTF files. I guess it is something related to the genome fasta file, even though I cannot spot anything wrong with it.

Thank you, Francesco.

vwucher commented 7 years ago

Hello,

Yeah, it seems that one of your files, either the annotation PMZ_v3.1_genes.gtf or the genome Lamprey_final_assembly_07_15_2016.fasta, have a bad format or some unexpected characters in it.

Have you check these two files? You can try to divide your annotation into several files in order to find the error instead of checking all the file.

Thanks, Valentin

yongjieliu commented 7 years ago

hello~ I get the the following error when I run this command l: FEELnc_codpot.pl -i input.gtf -a ref.coding.gtf -g ref.fa --mode=shuffle

The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences Input error: Invalid input file, expecting nucleotide sequence line on line 28194

then, I checked my ref.fa file. And there is nothing wrong with my ref.fa cat ref.fa | tail -n +28190 | head -n 6

CCGAATATTGTTGTTTTCATTGCTTCAGGTGGAGTAAAGGCTTCAGAAGGTGGAGGGGTG GGGTGTGGGAGAGAGCCTGACCTGTTTCAGGTGAAATCCGCTGCTGAAAAAGTGGGTTTG TGTCTCACTGAGGTGAGTTCGGAACTCCGGGATTCTCCGGAGCTGCTTCCACAAAGAGGG GAGTTTAGGAATTTGTTCATGAAGCGTAATTGGAAGATAAATGAGTGATTGAGTTTTCCT GAAAGCTGCTGTTCATAAAGCGATATTGGGGTCCTCAGACTGCTGAATCCACAGTCCTGC TGCTCAAAGCTATTCATAGCTGAACGCCAGAAGAAACCCTGGAAGCTGTGAAAAAGGGAA

tderrien commented 7 years ago

Hello yongjieliu, It seems that Bioperl doesn't like your ref.fa file... Could you try to "vi" the file and go to line 28194 in order to see some possible hidden characters? Best,

Thomas

yongjieliu commented 7 years ago

Hello tderrien,

I double check my ref.fa, and find no hidden characters. It's just like normal fasta file

Best,

yjliu

tderrien commented 7 years ago

Hello yjliu,

Looking at the Bioperl error message, I suspect that one of your sequence identifier in the "ref.fa" file does not meet bioperl requirement. It seems that one sequence id starts with >_ whereas it should not be the case according to Bioperl. If the ref.fa file is not too big, you can do : grep '>_' ref.fa to identify the all sequence ids and potentially those starting with ">_"

Let me know if this helps Best,

Thomas

yongjieliu commented 7 years ago

Hello tderrien,

Thank you for your reply. And output nothing when I run this: grep '>_' ref.fa

I also try to run this command : FEELnc_codpot.pl -i candidate_lncRNA.fa -a known_mRNA.fa --mode=shuffle I get anoter error. It seems that I can only input gtf file rather than fasta file

Best,

yjliu

vwucher commented 7 years ago

Hi,

Sorry for the late reply... Can you please send us the error with the command: FEELnc_codpot.pl -i candidate_lncRNA.fa -a known_mRNA.fa --mode=shuffle Moreover, is the version of the Bioperl package the good one? Maybe it can come from this.

Sorry again for the late reply, Valentin Wucher

Tichaboni commented 6 years ago

Hi @tderrien @yxlong032 I get the following error when running FEELnc_codpot.pl Your input GTF file 'Gallus_gallusannotation.gtf' contains 16354 transcripts Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/share/perl5/Bio/DB/Fasta.pm line 295. The same question is asked above but the suggestion provided doesn't work for me.

Tichaboni

vwucher commented 6 years ago

Hi Tichaboni,

For me it seems that the BioPerl library are either not install or the link to accede to them is not effective. Can you check if the BioPerl library is install and it is in the library PATH of perl?

Thanks, Valentin Wucher

Tichaboni commented 6 years ago

@vwucher @tderrien @yxlong032 Bioperl is installed and is on PATH /home/new/perl5/bin: I updated Bioperl and was able to run codpot but it threw another error

The number of complete ORF found with computeORF mode is 0 transcripts... That's not enough to train the program

The suggested solution is to remove the index file and chr prefix in the annotation and genome file and rerun codpot, I did that but still get the error. Would you please help me get past this error?

tderrien commented 6 years ago

Hello Tchaboni,

Could you indicate which versions of Perl and Bioperl are installed on your system? Best,

Thomas

Tichaboni commented 6 years ago

perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' 1.006924 This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level

vwucher commented 6 years ago

Hi,

FEELnc have been tested using a version of Bioperl of 1.6.924: 'Bioperl : tested with version BioPerl-1.6.924;' So maybe the error is coming from the different version of Bioperl. Can you try to update Bioperl?

Thanks, Valentin

Tichaboni commented 6 years ago

Hi, That is the same version of bioperl I have installed.

vwucher commented 6 years ago

Sorry, I misread the version...

When you run the 'codpot' module, you use a FASTA file or a GTF file with a genome file? And can you try running the 'codpot' using these two options: '--learnorftype=4 --testorftype=4' and tell us if you still have this error? These options make the 'codpot' module use the complete transcript sequence if it didn't find any start/stop codon. Moreover, how much transcripts do you have in your input files? Because this error means that the program didn't find any start or stop codons in any of your sequences. So it is bit unlikely to happen if you have a large number of transcripts.

Thanks, Valentin

Tichaboni commented 6 years ago

Hi I am using the gtf file generated by 'classifier' module as the input in the 'codpot' module. the annotation file is .gtf while the genome file is .fa. I have tried running 'codpot' with the options: '--learnorftype=4 --testorftype=4' and it still throws the same error. I have over 170000 sequences in the input file this is the command I am using FEELnc_codpot.pl -i FLcandidate_lncRNA.gtf -a Gallus_gallusannotation.gtf -b transcript_biotype=protein_coding -g galGal5genome.noCHR.fa --mode=shuffle --learnorftype=4 --testorftype=4

tderrien commented 6 years ago

Hi,

Could you please copy/paste the entire output following the FEELnc_codpot command line? And also a head of the 3 input files to see whether the issue could be related to a a pb of formatting entries ? Tx you very much

Tichaboni commented 6 years ago

The first part of the command produces Your input GTF file 'Gallus_gallusannotation.gtf' contains 16354 transcripts Then after running some more Extracting ORFs/cDNAs 15302/16354... Extracted15302 ORFs/cDNAs sequences on 16354. > The lncRNA training file is not set. get ORF/cDNAs for lncRNA by shuffling mRNA sequences. Then does Extracting (40K) and Parsing and gives the following output. (It is very long so I showed the last part only)

ExtractFromFeature::feature2seq: your seq Gene_1079_1_CDS_137312242..>137312412:1-96 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('galGal5genome.noCHR.fa.index')... ExtractFromFeature::feature2seq: your seq Gene_7209_14_CDS_join(<13743127..13743168,13744149..13744370,13746294..13746415,13747736..13747878,13752967..13753142,13754332..13754423,13755591..13755697,13756501..13756649,13759102..13759305,13759460..>13759468):1-6 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('galGal5genome.noCHR.fa.index')... ExtractFromFeature::feature2seq: your seq Gene_7629_17_CDS_complement(8625721..>8625867):1-147 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('galGal5genome.noCHR.fa.index')... ExtractFromFeature::feature2seq: your seq Gene_4748_6_CDS_complement(join(11090774..11090853,11092246..11092413,11093107..>11093149)):1-78 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('galGal5genome.noCHR.fa.index')... ExtractFromFeature::feature2seq: your seq Gene_7479_16_CDS_complement(482862..>483050):1-239 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('galGal5genome.noCHR.fa.index')... The number of complete ORF found with computeORF mode is 0 transcripts... That's not enough to train the program head FLchcandidate_lncRNA.gtf Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 1 149 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 1 149 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 1 149 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 1 105 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 1 156 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 150 211 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 150 211 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 212 373 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 212 373 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1 Gene_1178_1_CDS_167118885..>167119040 CLCBIO exon 377 451 0 +.";ne_id "Gene 1178"; transcript_id "Gene 1178.1

head Gallus_gallusannotation.gtf

!genome-build Galgal4

!genome-version Galgal4

!genome-date 2011-11

!genome-build-accession NCBI:GCA_000002315.2

!genebuild-last-updated 2013-12

1 ensembl gene 1735 16308 . + . gene_id "ENSGALG00000009771"; gene_version "4"; gene_source "ensembl"; gene_biotype "protein_coding"; 1 ensembl transcript 1735 16308 . + . gene_id "ENSGALG00000009771"; gene_version "4"; transcript_id "ENSGALT00000015891"; transcript_version "4"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; 1 ensembl exon 1735 2449 . + . gene_id "ENSGALG00000009771"; gene_version "4"; transcript_id "ENSGALT00000015891"; transcript_version "4"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSGALE00000301221"; exon_version "1"; 1 ensembl CDS 2379 2449 . + 0 gene_id "ENSGALG00000009771"; gene_version "4"; transcript_id "ENSGALT00000015891"; transcript_version "4"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSGALP00000015874"; protein_version "4"; 1 ensembl start_codon 2379 2381 . + 0 gene_id "ENSGALG00000009771"; gene_version "4"; transcript_id "ENSGALT00000015891"; transcript_version "4"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

head galGal5genome.noCHR.fa

1 GCTCACCCCGGCTCCTCCTCCCACGCGTCATTTTAGTACCACTCCGTGGG GGGAGTTGCCAGACGCTCCAACCTACAAAGAGCGCCTCGGACCCGCCGCA GCTTCCGCCACGGTCCCCCGTGAGGTCCTCGCTAAGAAGAGCGCTCCGCC GGCACCGCTGTGCCCTCAGCCCGGCCCttctctgcccctctccccgcACA CGATGGGCGCAGTCCCGCACCTCCACGGTgctaagggagaaagagagcgG CGGAGCCCTTCCCCGCCCCGAGAGGCGACGGCGCGCGAAGGAGACGAAGA ACGGCAGTCTCAGAGAGGTAATGccggggggcagcgggaaagggccgagc ggggaaaggagcccgagcccacgggaggggccccgggcGGTCCGCGCTGA GGGCGGCGGTGCCGTCCCGCTGTGCCGCGAGCTGAGGCGGAGGGAAGAGA

tderrien commented 6 years ago

The main issue seems to be related to the fact that the chr. names in your "FLchcandidate.gtf" file do not match the chr. names in the genome file. How did you get the "FLchcandidate.gtf" file i.e using mapping-based (e.g cufflinks or stringtie) or read-assembly strategy (e.g Trinity)?

Tichaboni commented 6 years ago

I used CLC Genomics Workbench to generate "FLchcandidate.gtf" file. It uses Large gap mapper in its transcript discovery module.

tderrien commented 6 years ago

Hi,

Ok I didn't know this CLC Workbench solution but it seems that the output .gtf file is somewhat mis-formatted. Could you check whether it returns a true .GTF file and whether the chr. names matched between this file and the genome reference file? An alternative (free) solution would be to map reads and reconstruct transcritps (and thus a correct .GTF) using tophat2+cufflinks or Hisat+stringtie. HTH

Thomas

Tichaboni commented 6 years ago

Hi, @tderrien @vwucher

I went back and generated a new gtf file using Hisat2+cufflinks. This worked for "codpot' but generated an error about kmer. I included the option -k 1,2,3,6,9 as suggested above and it worked like a charm. Thanks for the help. New issue with 'classifier'. I have written it as a new issue.

vwucher commented 6 years ago

Hi,

Thanks for letting us know :).

zetamui commented 6 years ago

@Tichaboni @vwucher I have also had the same error:

Extract ORFs/cDNAs for mRNAs from a GTF file Parsing file 'annotation_chr38.gtf'... Parse input file: [----------------------------------------------------------------------------------------------------] Your input GTF file 'annotation_chr38.gtf' contains 254 transcripts Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/local/share/perl/5.22.1/Bio/DB/Fasta.pm line 295.

After reading @vwucher 's comment about library path of perl lib, I have check echo $PERL5LIB and it is ${FEELNCPATH}/lib.

I then checked all the perl lib path by perl -e "print qq(@INC)" and:

/usr/local/share/perl/5.22.1 /home/zeta/Utilities/FEELnc/lib/ /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .

So I checked if the module Bio::DB::IndexedBase can be accessed by perldoc -l Bio::DB::IndexedBase and as expected it returned:

/home/zeta/Utilities/FEELnc/lib/Bio/DB/IndexedBase.pm

That prompt me to check the DB/IndexedBase.pm in the FEELnc directory to see if there is the subroutine _strip_crnl and it couldn't be found. I go on BioPerl official website to check the source of their IndexedBase.pm and they do have _strip_crnl. I then found out that the shared BioPerl in our system actually has the correct IndexedBased.pm so I then exported the $PERL5LIB variable with the one provided in FEELnc together:

export PERL5LIB=/usr/local/share/perl/5.22.1:${FEELNCPATH}/lib/

and it worked!! Now, I can finally continue with finding lncRNAs.......

PS: I realized not everyone have this problem and somehow they can use the global perl library even if their echo $PERL5LIB is only ${FEELNCPATH}/lib. I am no programmer nor familiar with linux so not sure why.

Hope this helps, and hope the library included in the FEELnc/lib is updated in the git cloned version ASAP! It is a good tool especially for us who don't quite have a good non-coding set to use for training, thanks for making the tool.

tderrien commented 6 years ago

Thanks @zetamui for pointing this! Actually, the _strip_crnl issue related to Bioperl version >=1.7 whereas the Bio::DB::IndexedBase in FEELnc is 1.6. Thus it created conflicts when people used Bioperl version >=1.7.

I dit update yesterday the README Install to avoid conflict bw the 2 versions:

export PERL5LIB=$PERL5LIB:${FEELNCPATH}/lib/ #order is important to avoid &Bio::DB::IndexedBase::_strip_crnl error with bioperl >=v1.7

A better way would be to remove the Bio::DB::IndexedBase from FEELnc. Note also that a Bioperl patch is available in order to avoid this _strip_crnl issue. Best,

Thomas

zetamui commented 6 years ago

Oh @tderrien Thank you very much for the explanation! Best, Zeta

brainfo commented 6 years ago

same problem, and cannot be fixed by methods above, any advice or any conclusion on how to solve the problem: Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/local/share/perl/5.26.0/Bio/DB/Fasta.pm line 295. I've cpanm Bio::DB::IndexdBase Thx!!

brainfo commented 6 years ago

and how can we install specific version BioPerl?? thx!

ginac commented 5 years ago

I came to this page because I had the error 'MSG: Failed validation of sequence '1'. Invalid characters" which is in this thread. There was no problem with my files or indices. I introduced line breaks into the FASTA file and it solved the problem. I realized this when I isolated the sequence on which it was bombing and ran the program again only on that sequence. Perl tells me that 'MSG: Each line of the file must be less than 65,536 characters. Line 2 is 721461 chars.'