maxrossi91 / moni

MONI: A Pangenomic Index for Finding MEMs
MIT License
37 stars 9 forks source link

sd_vector_builder: requested capacity is larger than vector size. #4

Closed brunelloandrea closed 1 month ago

brunelloandrea commented 1 year ago

Hello, unfortunately, after installing Moni with the .sh script, I am now facing another issue. When I try to execute the example code:

moni build -r data/SARS-CoV2/SARS-CoV2.1k.fa.gz -o sars-cov2 -f

I get the following error:

==== Command line: ./moni-0.2.0-Linux/bin/newscanNT.x ./SARS-CoV2.1k.fa.gz -w 10 -p 100 -f -s Windows size: 10 Stop word modulus: 100 Total input symbols: 0 Found 1 distinct words Parsing took: 0 wall clock seconds Sum of lenghts of dictionary words: 11 Total number of words: 1 Writing plain dictionary and occ file Dictionary construction took: 0 wall clock seconds Generating remapped parse file Remapping parse file took: 0 wall clock seconds ==== Elapsed time: 0 wall clock seconds malloc_count ### exiting, total: 90960, peak: 54361, current: 4096 [INFO] 14:49:21 - Message: Building the sequence index terminate called after throwing an instance of 'std::runtime_error' what(): sd_vector_builder: requested capacity is larger than vector size.

I have tried to look for it online, but unfortunately I cannot find anything.

maxrossi91 commented 1 year ago

Hi, Andrea,

Something doesn't seem right. It seems the input is empty somehow: Total input symbols: 0 Can you see if the gzipped file actually contains something or not?

brunelloandrea commented 1 year ago

Ok, thanks. With the Github's SARS file it works, so the problem is that I am probably creating my files in a wrong manner. As far as I understand, I should always build a text file with a single, contiguous, string on a single line, then compress it using gzip (?)

Specifically, in my case, I am considering binary strings, made by a lot of 0s and sparse groups of 1s. The reference string is typically around 1 million characters long, while the other one is typically around 50.000.

maxrossi91 commented 1 year ago

I see. The issue should not be in compressing or using a single line (as long as you use the -f flag meaning the input is FASTA format). If you use binary strings, you may need to play with the -p and -w parameters to allow the trigger strings to be set. Default values are -w 10 and -p 100. In your case I would probably try -w 5 and -p 30 or something on this line.

brunelloandrea commented 1 year ago

Alright, thank you. How should I interpret those parameters, intuitively? I have read the original article, but I am not quite sure.

maxrossi91 commented 1 month ago

Hi, @brunelloandrea, the w and p parameters control the prefix-free parsing step. In particular the w parameter is the length of a trigger string (i.e., a string that delimits the parsing's phrases) and the p parameter can be interpreted as the average distance between trigger strings.

In practice trigger strings are identified as all w-mers of the input text such that their Karp-Rabin hash is congruent 0 modulo p.