rajewsky-lab / mirdeep2

Discovering known and novel miRNAs from small RNA sequencing data
GNU General Public License v3.0
135 stars 49 forks source link

miRDeep2_core_algorithm.pl error #79

Closed banedeus closed 3 years ago

banedeus commented 3 years ago

I ran the following command ;

miRDeep2.pl reads_collapsed.fa GCA_018258275.1_ASM1825827v1_genomic.fa reads_collapsed_vs_genome.arf mature.fa annotations_16266764565788.fasta hairpin.fa

Everything was going fine until i saw an error. First it said:

Negative repeat count does nothing at /home/bane/anaconda3/bin/quantifier.pl line 1312, line 544.

Then

Use of uninitialized value in numeric le (<=) at /home/bane/anaconda3/bin/select_for_randfold.pl line 703, line 60157.

Then it goes like this

"Use of uninitialized value in numeric le (<=) at /home/bane/anaconda3/bin/miRDeep2_core_algorithm.pl line 1148, line 68767. Use of uninitialized value in numeric le (<=) at /home/bane/anaconda3/bin/miRDeep2_core_algorithm.pl line 1174, line 68767. Use of uninitialized value in numeric le (<=) at /home/bane/anaconda3/bin/miRDeep2_core_algorithm.pl line 1148, line 68767." for almost 400.000 lines.

I could not upload the report.log here (ı guess it is because my log file is 57 mb) but here is the link that if you want to check out the log file and it will be stored there for 30 days from now on.

https://easyupload.io/hxygao

What can be the issue and how can i solve it?

mschilli87 commented 3 years ago

What OS is that on? How did you install miRDeep2? Can you successfully run the tutorial with your installation?

banedeus commented 3 years ago

Im using Ubuntu 20. I installed through conda and yeah i completed the tutorial successfully

mschilli87 commented 3 years ago

First, be warned that the conda package is not officially supported or maintained by anyone related toi miRDeep2. However, if the tutorial runs, it is likely due to your input data. I'd recommend creating a minimal input dataset that can reproduce the error. Oftentimes, just doing so will give you an idea what might be going wrong. Otherwise, it will at least enable others to reproduce your issue and help you fix it. Try using just ten (random) reads or so and see if you get the error. If not, you can start with a larger chunk and narrow it down. Worst case, you'll end up with a binary search starting with half the data until you get it down to a handful of reads. Once we can look at your data and the exact command you are using, trouble-shooting this should be much easier. :wink:

banedeus commented 3 years ago

Okay thank you so far. I will try it as soon as possible and i will post my workflow here.

Just out of curiosity, what might be the problem? What would be wrong with my data? Do you have any idea? And please explain it simply. I am newbie in bioinformatics

banedeus commented 3 years ago

Okay so i tried with 15.000 lines of my fastq data and the result is the same. so what i did was First;

mapper.pl  SRA.fastq  -e -i -h -j -k AGATCGGAAGAG -l 18 -m -p index_2 -s reads_collapsed.fa -t reads_collapsed_vs_genome.arf -v -u

Output;

parsing fastq to fasta format
converting rna to dna alphabet
discarding sequences with non-canonical letters
clipping 3' adapters
discarding short reads
collapsing reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_64510
Mapping statistics

#desc   total   mapped  unmapped    %mapped %unmapped
total: 4214 1826    2388    43.332  56.668
seq: 4214   1826    2388    43.332  56.668

Second;

quantifier.pl -p hairpin.fa -m mature.fa -r reads_collapsed.fa index_2 -y 16_19

Output;

getting samples and corresponding read numbers

Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics

#desc   total   mapped  unmapped    %mapped %unmapped
total: 4183 12  4171    0.287   99.713
seq: 4183   12  4171    0.287   99.713
analyzing data

0 mature mappings to precursors

Expressed miRNAs are written to expression_analyses/expression_analyses_1626760028/miRNA_expressed.csv
    not expressed miRNAs are written to expression_analyses/expression_analyses_1626760028/miRNA_not_expressed.csv

Creating miRBase.mrd file

Mapped READS readin - DONE 

make_html2.pl -q expression_analyses/expression_analyses_1626760028/miRBase.mrd -k mature.fa -y 1626760028  -o -i expression_analyses/expression_analyses_1626760028/mature.fa_mapped.arf  -l  -M miRNAs_expressed_all_samples_1626760028.csv  
miRNAs_expressed_all_samples_1626760028.csv file with miRNA expression values
parsing miRBase.mrd file finished
creating PDF files

Warning: 0 mature sequences mapped to any of your given precursor sequences

Note: I assume that tool could not find any mature sequence because my data is really small in that case.

Thirth;

miRDeep2.pl reads_collapsed.fa GCA_018258275.1_ASM1825827v1_genomic.fa reads_collapsed_vs_genome.arf mature.fa annotations_16266764565788.fasta hairpin.fa 2> report.log`

Output;

#####################################
#                                   #
# miRDeep2.0.1.3                   #
#                                   #
# last change: 08/11/2019           #
#                                   #
#####################################

miRDeep2 started at 8:48:27

#Starting miRDeep2
#testing input files
#Quantitation of known miRNAs in data
#parsing genome mappings
#excising precursors
#preparing signature
#folding precursors
#computing randfold p-values
#running miRDeep core algorithm

Here is my report.txt

If you need any further information just let me know

report.txt


edit (@mschilli87): formatting

mschilli87 commented 3 years ago

I suggested

... I'd recommend creating a minimal input dataset that can reproduce the error. [...] Try using just ten (random) reads or so and see if you get the error. ...

Now you say

Okay so i tried with 15.000 lines of my fastq data and the result is the same.

:confused:

The point is that the computer program fails on your data so I want to be able to follow the entire algorithm in my brain to see where it fails. For this I need to actually see the input data. But please don't share 1.500 times more data than I asked for and have me narrow it down.


Just out of curiosity, what might be the problem? What would be wrong with my data? Do you have any idea? And please explain it simply. I am newbie in bioinformatics

Hard to tell without seing the data. Many thing could be wrong. Check past issues for examples. If even a small (much smaller than thousands of lines) input triggers the issue, chances are you specify a wrong flag or there is an issue with the format of your data (e.g. you telling miRDeep to expect a FASTA but using a FASTQ). Regardless, guessing is a waste of time. Please just narrow it down step by step until we can actually understand what is happening.

banedeus commented 3 years ago

okay i am deeply sorry what i did. so here is my fastq file and my report.log

I used the same codes above with 10 random reads of my data but this time there is nothing wrong. so i will also include my 15.000 lines of data which gave an error last time.

Also i changed my data names .fastq to .txt in order to upload here.

data.txt

report_of_10.log

SRA.txt

mschilli87 commented 3 years ago

Just quoting myself again:

... Try using just ten (random) reads or so and see if you get the error. If not, you can start with a larger chunk and narrow it down. Worst case, you'll end up with a binary search starting with half the data until you get it down to a handful of reads. ...

:wink:

banedeus commented 3 years ago

I understand your point of view but i have 45 millions of reads. I started with ten and now i am going to try a thousand reads. I saw that i get an error around 4 thousand reads. What i do not understand is, the problem is related with read numbers or some "bad" reads in my data. Simply, how am i going to detect the "problematic" reads in my data. Also i have 4 data that i have to process just like this one. It will take forever

mschilli87 commented 3 years ago

What do you expect? Somebody else doing this work for you? How long does running your script take for with 4000 reads that reproduce the error? Split this file in half and try both halves. Do you get the error for none, both or only one of the halves? If none, it appears to be an issue with the number of reads. I'd continue trying 3000 next, then 2500 or 3500 depending on whether you get the error with 3000 or not and so on. log2(4000) < 12 so by binary search you should be able to get the exact number of reads you need to cause the issue by running miRDeep2 12 times. That's hardly going to take 'forever'. If you get the error with your 2000 reads set in step one, you can apply the same stategy to find 'the read' causing the issue. Once you do, try if using just this one read causes the error. Then you can share the data here and maybe someone can help you.

mschilli87 commented 3 years ago

@banedeus: After formatting you earlier post I noticed this comment of yours that was previously hiding in the log output:

Note: I assume that tool could not find any mature sequence because my data is really small in that case.

This is also something you could verify. Instead if assuming (i.e. guessing), why not simply add a positive control to your input? Simply use your data plus five or so reads mapping to miRNAs (e.g. from the tutorial data, which already worked for you)?

Drmirdeep commented 3 years ago

1.) search in the issues section for 'Negative repeat'

This happened already a couple of times and got solved. The problem was caused by inconsistent files. A way to get proper input files is described here

https://drmirdeep.github.io/mirdeep2_tutorial.html

2.) Furthermore, it is unclear what index_2 is doing in this command

quantifier.pl -p hairpin.fa -m mature.fa -r reads_collapsed.fa index_2 -y 16_19

3.) If the tutorial runs fine then it has definitely to do with your input files. Compare the format of the tutorial files with yours and you will find the issues.

4.) The files you uploaded are not useful. Further, your file namings are completely noninformative for us to guess what is inside and what not.

Please run a head -n20 on each of these files and post it here

reads_collapsed.fa GCA_018258275.1_ASM1825827v1_genomic.fa reads_collapsed_vs_genome.arf mature.fa # mature_ref_miRNAs.fa ? annotations_16266764565788.fasta # mature_other_miRNAs.fa ? hairpin.fa # hairpin_ref_miRNAs ?

Drmirdeep commented 3 years ago

Warning: 0 mature sequences mapped to any of your given precursor sequences

0 => zero => None of your mature sequences could be mapped to sequences in your precursor files