maxrossi91 / moni

MONI: A Pangenomic Index for Finding MEMs
MIT License
37 stars 9 forks source link

Bugfix related to newscan.x and creation of SLP #5

Closed oma219 closed 8 months ago

oma219 commented 9 months ago

Hi Max,

There are two separate issues that occur when a reference text begins with a trigger string, where mod p == 0. Here is a brief description of the issues.

  1. When using multiple threads, moni will use newscan.x to create the *.parse and *.dict file but newscan.x has a issue where it will not report the first trigger string as a phrase and this leads to pfp_thresholds creating an incorrect BWT.

    • Solution/Workaround: Not use any multi-threading since newscanNT.x does not have that same issue. It appears like newscan.x might require some more testing to ensure that there are no differences with the single-threaded version.
  2. When the first w-mer is a trigger string, downstream, compress_dictionary will write the compressed length of each phrase after stripping the trigger string but that will be 0. This causes an issue later on because when procdic combines the grammar of the parse and dictionary, it will try to create a rule for the first phrase but there is no text in the *.dicz to create a grammar for so it pushes all the rules off by one. Long story short, it creates an incorrect SLP so matching statistics are incorrect because the random-access to the text in the second pass it not correct.

    • Solution/Workaround: Remove that first phrase from *.dicz and *.dicz.len and decrement all phrase ID in the *.parse file to ensure the *.dicz and *.parse file have the same number of phrases and they correspond to each other.

Here is an example text that causes this issue for testing:

>example_reference
AAATATATATAAATATATATATAAGTAAATATAAATATTATATAGATATATAAATATATAAATATATATATAATATATAA
TAAATATATATATAAATAAATATAAATATATATAAATATATATAAATAAATATATATAAATATATATAAATATATAAATA
TATATATATATATATATATAGTACTGCCTGCTGGAGAGTCCATTCTAGATAACCAACAAACCCTCAAACTTAGTTCACTA
TGTTTACTCCCCAAATCTGCTTCCCTTCTCATGTGATGCCAGAAAAGATATTAAACATTCATCCCCTTGCCAAGGCTGGG
oma219 commented 9 months ago

Just to followup, I found a separate bug in the ms_rle_string.hpp, it has to do with comparing unsigned chars to signed chars. This particular bug would convert any unsigned chars > 127 to 1 when BWT is loaded from file into an r-index object. I suspect this issue has not been encountered much at all since most text files would not use an unsigned char > 127.