Closed oma219 closed 8 months ago
Just to followup, I found a separate bug in the ms_rle_string.hpp
, it has to do with comparing unsigned chars to signed chars. This particular bug would convert any unsigned chars > 127 to 1 when BWT is loaded from file into an r-index object. I suspect this issue has not been encountered much at all since most text files would not use an unsigned char > 127.
Hi Max,
There are two separate issues that occur when a reference text begins with a trigger string, where
mod p == 0
. Here is a brief description of the issues.When using multiple threads,
moni
will usenewscan.x
to create the*.parse
and*.dict
file butnewscan.x
has a issue where it will not report the first trigger string as a phrase and this leads topfp_thresholds
creating an incorrect BWT.newscanNT.x
does not have that same issue. It appears likenewscan.x
might require some more testing to ensure that there are no differences with the single-threaded version.When the first w-mer is a trigger string, downstream,
compress_dictionary
will write the compressed length of each phrase after stripping the trigger string but that will be 0. This causes an issue later on because whenprocdic
combines the grammar of the parse and dictionary, it will try to create a rule for the first phrase but there is no text in the*.dicz
to create a grammar for so it pushes all the rules off by one. Long story short, it creates an incorrect SLP so matching statistics are incorrect because the random-access to the text in the second pass it not correct.*.dicz
and*.dicz.len
and decrement all phrase ID in the*.parse
file to ensure the*.dicz
and*.parse
file have the same number of phrases and they correspond to each other.Here is an example text that causes this issue for testing: