marsicoLab / PROmiRNA

miRNA promoter annotation based on deepCAGE data
0 stars 1 forks source link

Problem while running ./PROmiRNA for mouse miRNA TSS prediction: string_base.h:448 Assertion failed #1

Open YucanChen opened 3 years ago

YucanChen commented 3 years ago

Hello, I want to use PROmiRNA for mouse miRNA TSS prediction. While running the program, I bumped into problem like these. I wonder if it is because using the wrong -s file (gene start region gff), and I have not seen any detailed description about this file in README file. I tried "Mus_musculus.GRCm38.101.gff3" and "mus_musculus.GRCm38.Regulatory_Build.regulatory_features.20180516.gff", both failed.

The error output is listed below:

$./PROmiRNA -g ../external_data/mm10.fa -c ../external_data/Mus_musculus.GRCm38.101.gtf -s ../external_data/Mus_musculus.GRCm38.101.gff -r ../external_data/mm10_repeats.bed -a ../external_data/mmu.gff3 -m ../external_data/mirna.txt -n ../external_data/mirna_context.txt -p ../external_data/TATA_box_jaspar.psem -w ../external_data/mm10.60way.phastCons.wig -i ../external_data/bed_files/ -t 16

Starting miRNA promoter prediction Number of miRNAs for analysis: 1226 Number of overlaps between tags and miRNAs: ../external_data/bed_files/mm10_fair_new_CAGE_peaks_phase1and2.bed 9312 Unique regions TSS: 9312 /mnt/c/Users/Administrator/Documents/GitHub/PROmiRNA/seqan/include/seqan/sequence/string_base.h:448 Assertion failed : static_cast(pos) < static_cast(length(me)) was: 1450975284 >= 7 (Trying to access an element behind the last one!)

stack trace: 0 [0x7f6a56feefaf] ./PROmiRNA(+0x3bfaf) 1 [0x7f6a5700ce08] ./PROmiRNA(+0x59e08) 2 [0x7f6a5703a5ef] mergeAndFilter(std::vector<DatasetRecord, std::allocator >&, MatrixPair const&, MatrixSingle const&, std::vector<std::map<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation>, std::pair<unsigned int, bool>, std::less<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> >, std::allocator<std::pair<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> const, std::pair<unsigned int, bool> > > >, std::allocator<std::map<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation>, std::pair<unsigned int, bool>, std::less<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> >, std::allocator<std::pair<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> const, std::pair<unsigned int, bool> > > > > > const&, std::map<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation>, unsigned int, std::less<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> >, std::allocator<std::pair<std::pair<seqan::String<char, seqan::Alloc >, GenomicLocation> const, unsigned int> > > const&, std::map<seqan::String<char, seqan::Alloc >, unsigned int, std::less<seqan::String<char, seqan::Alloc > >, std::allocator<std::pair<seqan::String<char, seqan::Alloc > const, unsigned int> > >&, seqan::String<char, seqan::Alloc > const&, seqan::String<char, seqan::Alloc > const&, std::map<seqan::String<char, seqan::Alloc >, seqan::String<char, seqan::Alloc >, std::less<seqan::String<char, seqan::Alloc > >, std::allocator<std::pair<seqan::String<char, seqan::Alloc > const, seqan::String<char, seqan::Alloc > > > > const&, std::vector<GenomicLocation, std::allocator >&, unsigned int) + 0xd9f 3 [0x7f6a56fe6400] main + 0x1dc0 4 [0x7f6a565b70b3] __libc_start_main + 0xf3 5 [0x7f6a56fe91f6] ./PROmiRNA(+0x361f6)

Aborted (core dumped)

sarahet commented 3 years ago

Hi @YucanChen

I apologize we did not provide a better explanation, I will add this now. The gene starts file should be a file marking the gene starts or promoters, e.g the first exon or a pre-defined region around the TSS of genes/transcripts you would like to consider for the analysis. These are the regions that get excluded in order to not call the promoter of another gene/transcript by accident. If you have your desired file with this information and you still get this error, could you please paste the first lines here? Then I will have a better chance helping you.

Sara

YucanChen commented 3 years ago

Thank you @sarahet. I've obtained the TSS annotation data for Mus musculus (GRCm38.p1) obtained from biomaRt, and the file is in gff3 format with .gff suffix. However, the problem still occured with the same error output, which seemed to be running past the end of an array. I wonder where the length,"7", of the array was defined? It does not make sense if the problem was due to the improper format of files, since the most of they were downloaded from the official websites, and .psem and the repeat file were from the link in paper. I can't figure out which files are problematic or if there are some other reasons leading to the issue. Could you help me with that?

sarahet commented 3 years ago

Could you please still paste the top 10 lines of the file you are using in here or send me a direct link to the file such that I can download it? Of course it could be a bug but I actually need to see what the input looks like no matter what the file format should be.

YucanChen commented 3 years ago

(1) These files were directly downloaded from webs: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz http://promirna.molgen.mpg.de/mm10_repeats.bed.gz ftp://mirbase.org/pub/mirbase/22.1/genomes/mmu.gff3 ftp://mirbase.org/pub/mirbase/22.1/database_files/mirna.txt.gz ftp://mirbase.org/pub/mirbase/22.1/database_files/mirna_context.txt.gz (2) This file was converted to wig file by UCSC bigWigToWig tool: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/phastCons60way/mm10.60way.phastCons.bw (3) This is the same as you put in the github: TATA_box_jaspar.psem (4) The TSS I extracted from biomart in R and exported to gff3 format using rtracklayer package: TSS.mouse.GRCm38.zip

hufanglq commented 3 years ago

It seems to randomly happened. I had same error, but I can go through this step after trying several times. Finally, I got an error for not enough cage data. I only supplied fantom5 CAGE peaks.

SianGol commented 3 years ago

@hufanglq Did you manage to overcome this issue? I have reached the same step and get this error:

Starting miRNA promoter prediction Number of miRNAs for analysis: 1226 Number of overlaps between tags and miRNAs: ~/PRO_miRNA_input/CAGE/mm10_fair+new_CAGE_peaks_phase1and2.bed 9312 Unique regions TSS: 9312 Unable to convert '.' into unsigned int. Number of TSS in dataset for EM algorithm: 0 Number of overlaps between tags and background regions: ~/PRO_miRNA_input/CAGE/mm10_fair+new_CAGE_peaks_phase1and2.bed 0 ERROR: No overlap found between CAGE tags and miRNAs/background. Rerun with more/other CAGE data.