jermp / sshash

A compressed, associative, exact, and weighted dictionary for k-mers.
MIT License
84 stars 17 forks source link

Errors while indexing small test sequence sets with even values of m for small k #33

Closed hmusta closed 11 months ago

hmusta commented 11 months ago

Hi Giulio,

Thank you for developing and providing this tool!

I was testing it out on some small data sets and noticed that for even values of m it fails to build an index.

I'm testing with the following sequence inputs:

>1
CATGTACTAGCTGATCGTAGCTAGCTAGC
>2
AAAAAAAAAAAA

and the following build command

./sshash build -i dump.fa -k 12 -m 6 -o dump_sshash

The above command produces this output:

k = 12, m = 6, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file 'dump.fa'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1701436753917127000.minimizers.0.bin'...
read 2 sequences, 41 bases, 19 kmers
num_kmers 19
num_super_kmers 5
num_pieces 3 (+3.47368 [bits/kmer])
=== step 1: 'parse_file' 0.000847 [sec] (44578.9 [ns/kmer])
terminate called after throwing an instance of 'std::runtime_error'
  what():  mmap failed
Abort trap: 6

I don't see any errors when I run it using valgrind, but the backtrace shows that it's aborting after this line:

mm::file_source<unsigned long>::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) (mm_file.hpp:133)

It does work, however, if I set m = 1,3,5,7, for m>=9 it segfaults or produces other errors. Strangely, I get different behaviour in different environments. It fails for all values of m on my Mac with g++12, but works with the values mentioned on a Linux system with g++9... Also, while trying to explore this, I replaced CATGTACTAGCTGATCGTAGCTAGCTAGC with CATGTAGCTGATCGTAGCTAGCTAGC and it works for even but not odd values of m on my Mac....

Please let me know if I can provide any more information to help debug this.

jermp commented 11 months ago

Hi Harun, and thank you for your interest in SSHash!

It should be fixed now, as of https://github.com/jermp/sshash/commit/d3ea2c49eb7aad4ed544a525af3e16baab849f53. The problem was due to PTHash, trying to access an empty file on disk.

I tried ./sshash build -i dump.fa -k 12 -m M --check --verbose for all values M = 1..12 and it works correctly.

Let me know if that is ok for you too.

-Giulio

PS. As a side note, I have also to say that SSHash is not meant to index such tiny examples. For me, the minimum interesting size is a single whole bacterial genomes like Salmonella o E. Coli.

hmusta commented 11 months ago

Thank you for fixing this! In the end, these corner cases help us ensure that we don't run into surprises later when working on larger data sets.

jermp commented 11 months ago

Yeah, true. You're welcome!