rajewsky-lab / mirdeep2

Discovering known and novel miRNAs from small RNA sequencing data
GNU General Public License v3.0
135 stars 49 forks source link

miRDeep2.pl line 363 issue with ARF format #77

Closed ConYel closed 3 years ago

ConYel commented 3 years ago

Hello and thank you for this very useful tool! I came across this issue found in the log file after running miRDeep2.pl

#Starting miRDeep2
/opt/mirdeep2/bin/miRDeep2.pl my_data/collapsed_trimmed_fasta/all_samples_collapsed.fasta my_data/genome_data/genome_sm.fa my_data/collapsed_trimmed_fasta/all_samples_collapsed.arf none my_data/all_genome_data/RNAcentral/my_species_RNAcentral_miRNA.fasta none

miRDeep2 started at 14:6:27

mkdir mirdeep_runs/run_09_05_2021_t_14_06_27

readline() on closed filehandle IN at /opt/mirdeep2/bin/miRDeep2.pl line 363.
Use of uninitialized value in pattern match (m//) at /opt/mirdeep2/bin/miRDeep2.pl line 363.
[1;31mError: [0mMapping file my_data/collapsed_trimmed_fasta/all_samples_collapsed.arf is not in arf format

Each line of the mapping file must consist of the following fields
readID_wo_whitespaces  length  start  end read_sequence genomicID_wo_whitspaces length  start   end     genomic_sequence  strand  #mismatches editstring
The editstring is optional and must not be contained
The readID must end with _xNumber and is not allowed to contain whitespaces.
The genomeID is not allowed to contain whitespaces.

I followed the instructions running first the mapper.pl in order to get an ARF file from my collapsed fasta. mapper.pl my_data/collapsed_trimmed_fasta/all_samples_collapsed.fasta -c -p my_data/all_genome_data/my_species_BOWTIE_index/genome.v1 -o 2 -l 13 -t my_data/collapsed_trimmed_fasta/all_samples_collapsed_mapped.arf -v

A minimal example of the file I got from mapper.pl:

head my_data/collapsed_trimmed_fasta/all_samples_collapsed_mapped.arf 
pli_4108859_x2572844    68  1   68  gtatggatcgaggtgtcgtgcacagtagagcagcatcttcgctgcgatctgctgaaactaatcctttc    Scaffold_2421   68  11467   11534   gtatggatcgaggtgtcgtgcacagtagagcagcatcttcgctgcgatctgctgaaactaatccgttc    +   1   mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmMmmm
pli_6681703_x2424514    68  1   68  tatggatcgaggtgtcgtgcacagtagagcagcatcttcgctgcgatctgctgaaactaatcctttcg    Scaffold_2421   68  11468   11535   tatggatcgaggtgtcgtgcacagtagagcagcatcttcgctgcgatctgctgaaactaatccgttcg    +   1   mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmMmmmm
pli_9106217_x681924 45  1   45  cgctgcgatctgctgaaactaatcctttcgatccgagggatttgt   Scaffold_2421   45  11506   11550   cgctgcgatctgctgaaactaatccgttcgatccgagggatttgt   +   1   mmmmmmmmmmmmmmmmmmmmmmmmmMmmmmmmmmmmmmmmmmmmm
pli_9788141_x535504 22  1   22  aggcaagatgttggcatagctg  Scaffold_1721   22  17920376    17920397    aggcaagatgttggcatagctg  -   0   mmmmmmmmmmmmmmmmmmmmmm
pli_10724122_x377557    40  1   40  cgatctgctgaaactaatcctttcgatccgagggatttgt    Scaffold_2421   40  11511   11550   cgatctgctgaaactaatccgttcgatccgagggatttgt    +   1   mmmmmmmmmmmmmmmmmmmmMmmmmmmmmmmmmmmmmmmm
pli_13143404_x291219    23  1   23  aggcaagatgttggcatagctgt Scaffold_1721   23  17920375    17920397    aggcaagatgttggcatagctgt -   0   mmmmmmmmmmmmmmmmmmmmmmm
pli_15148372_x183075    22  1   22  tattgcactcgtcccggccagt  Scaffold_2100   22  60553837    60553858    tattgcactcgtcccggccagt  +   0   mmmmmmmmmmmmmmmmmmmmmm
pli_15148372_x183075    21  1   21  tattgcactcgtcccggccag   Scaffold_2100   21  60551361    60551381    tattgcactcgtcccggcctg   +   1   mmmmmmmmmmmmmmmmmmmMm
pli_15148372_x183075    21  1   21  tattgcactcgtcccggccag   Scaffold_2100   21  60552881    60552901    tattgcactcgtcccggcctg   +   1   mmmmmmmmmmmmmmmmmmmMm
pli_12495405_x328387    22  1   22  tattgcactcgtcccggcctgc  Scaffold_2100   22  60552881    60552902    tattgcactcgtcccggcctgc  +   0   mmmmmmmmmmmmmmmmmmmmmm

Regarding the whole installation, I'm using a docker container made from the Dockerfile attached.

A similar issue seems to have been opened before: https://github.com/rajewsky-lab/mirdeep2/issues/74#issue-754866169 but there wasn't follow up from that person so I thought to open a new one in case it could be used for future reference.

Thank you for your time and effort.

Dockerfile.txt

EDIT:

I tried to run again the pipeline but only with the "head" of the .arf file: head my_data/collapsed_trimmed_fasta/all_samples_collapsed_mapped.arf > my_data/collapsed_trimmed_fasta/test.arf

and this time I got a different error:

#testing input files
started: 7:39:03
sanity_check_mature_ref.pl my_data/my_species_data/RNAcentral/close_related_species_RNAcentral_miRNA.fasta

ended: 7:39:03
total:0h:0m:0s

sanity_check_reads_ready_file.pl my_data/collapsed_trimmed_fasta/all_samples_collapsed.fasta

started: 7:39:03
[1;31mError: [0mproblem with my_data/collapsed_trimmed_fasta/all_samples_collapsed.fasta
Error in line 21: Either the sequence

GGTGTTGTAGGCTT

contains less than 17 characters or contains characters others than [acgtunACGTUN]

Please make sure that your file only comprises sequences that have at least 17 characters

containing letters [acgtunACGTUN]

I will remove sequences smaller than 17 and try again, but the first error message about the .ARF it seems a bit strange.

ConYel commented 3 years ago

When I removed sequences shorter than 18 from the collapsed fasta file and reperformed the map analysis to get the .ARF file mirDeep2.pl run without issues. Is it possible to provide an error message for this particular reason, because this error message: [1;31mError: [0mMapping file my_data/collapsed_trimmed_fasta/all_samples_collapsed.arf is not in arf format it is a bit misleading.

Maybe somewhere with this error message? Each line of the mapping file must consist of the following fields readID_wo_whitespaces length start end read_sequence genomicID_wo_whitspaces length start end genomic_sequence strand #mismatches editstring The editstring is optional and must not be contained The readID must end with _xNumber and is not allowed to contain whitespaces. The genomeID is not allowed to contain whitespaces.

Thank you very much, again great tool!

Drmirdeep commented 3 years ago

Hi, the two errors don't seem to be related.

To debug we would need to find the line which is not parsed properly. May you could run this in a console

perl -ane 'if($_ !~ /^(\S+_x\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\S+)\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\S+)\s+([+-])\s+(\d+)\s([mDIM])$/){print;}' YOUR_ARF_FILE

and post some of the lines here. (Special chars maybe typed in by you again depending on your console).

Cheers

ConYel commented 3 years ago

Hello, unfortunately, I'm not able to reproduce it. it always gives:

[1;31mError: [0mproblem with my_data/collapsed_trimmed_fasta/all_samples_collapsed.fasta
Error in line 21: Either the sequence

GGTGTTGTAGGCTT

contains less than 17 characters or contains characters others than [acgtunACGTUN]

Please make sure that your file only comprises sequences that have at least 17 characters

containing letters [acgtunACGTUN]

I don't know exactly where was the issue. with the ARF file as I used the mapper.pl but anyway, I'll keep an eye if by any chance get that particular error again.

Thank you!!