tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

FEELnc_codpot.pl stops due to NA values in _RF_learningData.txt #59

Closed laurafabre closed 1 year ago

laurafabre commented 1 year ago

Hi there, I'm facing a problem with 'FEELnc_codpot.pl'.I gave 2 fasta file (prot_coding.fasta and unknown_to_model.fasta) I have noticed that the generated '{sample}_RF_learningData.txt' file contains several NA values at the end of the last line. This might be the cause of the error or not. This is the message I got:

Extract ORFs/cDNAs for mRNAs from a FASTA file Extract ORF/cDNA from fasta file 'fus/lncRNA_prediction/feelnc/fus_prot_cod_genes_no_amb_nucl.fasta'.. Extracting ORFs/cDNAs 2640057/2651283... Extracted '2640057' ORF/cDNAs sequences on '2651283'. The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences Extracting ORFs/cDNAs 43196/44016... Extracted '43196' ORF/cDNAs sequences on '44016'. Extract ORFs/cDNAs for candidates RNAs from a FASTA file Extract ORF/cDNA from fasta file 'fus/lncRNA_prediction/feelnc/fus_unknown_and_antisense_transcripts_renamed_longer200_no_amb_nucl.fasta'.. Extracting ORFs/cDNAs 2957/2957... Extracted '2957' ORF/cDNAs sequences on '2957'. Run random Forest on '/tmp//1025736_fus_unknown_and_antisense_transcripts_renamed_longer200_no_amb_nucl.fasta.test_rna.fa'

  1. Compute the size of each sequence and ORF
  2. Compute the kmer ratio for each kmer and put the output file name in a list
  3. Compute the kmer score for each kmer size on learning and test ORF
  4. Merge the score and size files into one file for each type
  5. Make the model on learning sequences and apply it on test sequences Warning messages: 1: package ‘ROCR’ was built under R version 4.0.5 2: package ‘randomForest’ was built under R version 4.0.5 Warning messages: 1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string 2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : number of items read is not a multiple of the number of columns Running 10-fold cross-validation on learning | | 0%Error in randomForest.default(x = dat[-chunk[[n]], dat.featID], y = as.factor(dat[-chunk[[n]], : NA not permitted in predictors Calls: randomForest -> randomForest.default Execution halted Parsing random forest output: random forest output file 'fus/lncRNA_prediction/feelnc/feelnc_codpot_out//fus_unknown_and_antisense_transcripts_renamed_longer200_no_amb_nucl.fasta_RF.txt' is empty... exiting

And this is the tail of the '{sample}_RF_learningData.txt' file: FUN_014688-T1_CDS=435-953_loc:Fus7_91967493-1968731+_exons:1967493-1968731_segs:1-1239_product=hypothetical-perm3 0.500087 0.500149 0.487164 0.469400 0.460491 0.493747 0.200968523002421 1239 NA NA NA NA NA NA NA NA 0

I dont know what I should check or how I can fix it. Laura

laurafabre commented 1 year ago

Or maybe is an issue of library version: R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"

library(ROCR) Warning message: package ‘ROCR’ was built under R version 4.0.5 library(randomForest) randomForest 4.6-14 Type rfNews() to see new features/changes/bug fixes. Warning message: package ‘randomForest’ was built under R version 4.0.5

laurafabre commented 1 year ago

Merge the score and size files into one file for each type

  1. Make the model on learning sequences and apply it on test sequences The threshold for the voting in random forest is not defined. Use 10-fold cross-validation to determine the best threshold. Warning message: package ‘gplots’ was built under R version 4.0.5 Warning messages: 1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string 2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : number of items read is not a multiple of the number of columns Running 10-fold cross-validation on learning | | 0%Error in randomForest.default(x = dat[-chunk[[n]], dat.featID], y = as.factor(dat[-chunk[[n]], : NA not permitted in predictors Calls: randomForest -> randomForest.default Execution halted Parsing random forest output: random forest output file './feelnc_codpot_out/fus_unknown_and_antisense_transcripts_renamed_longer200_no_amb_nucl.fasta_RF.txt' is empty... exiting
tderrien commented 1 year ago

Hi Laura,

Tx for using FEELnc!

As you mentioned, it is hard to discriminate bw an issue related to the ROCR version and thus FEELnc installation or because of the input files. To easily check the 1st one, could you try ton install FEELnc using the conda dedicated env and relaunch it. If you still have the error, could you send us the exact command line you used for FEELnc_codpot.pl and (a part of) the input .fasta files.

Best,

Thomas

laurafabre commented 1 year ago

Hi Thomas, I was using the conda dedicated env to run my work and got the error. I tryed so many times with the fasta files and as I was having the same error I change the approximation. I use the gtf files and worked as expected. I don't know what could be wrong in a fasta file to not run, as I was to big to manually inspect and deleted ambiguous nucleotides but the error still there.

Thanks for your time! Laura