maickrau / GraphAligner

MIT License
261 stars 32 forks source link

What's the difference between "iterateMinimizersSimple" and "iterateMinimizersReal" in /src/MinimizerSeeder.cpp #102

Closed YellowJason closed 5 months ago

YellowJason commented 5 months ago

Hi, I have some questions about /src/MinimizerSeeder.cpp

I have read your paper and tried to reimplement myself, but the results of minimizer seeding I obtained differ from those produced by your program.

I studied /src/MinimizerSeeder.cpp in your program. There are two minimizer functions, "iterateMinimizersSimple" and "iterateMinimizersReal". I run two functions on the same sequence and parameter with callback [](size_t pos, size_t kmer) { cout << "pos_S: " << pos << " kmer: " << kmer << endl; }, which is just print the output minimizer. Then I found that the outputs of the two functions are different, my work is the same as the output of "iterateMinimizersSimple".

I made a deeper study on "iterateMinimizersReal" and found something weird in the deque "window". The code saves (position, k-mer, hash value) into deque, but when I took elements out through std::get<2>(window.back()) I found the hash value is changed. Maybe there is something wrong between data type size_t in the deque and uint64_t in the hash function.

Change the declaration of window into std::deque<std::tuple<size_t, size_t, uint64_t>> can fix this problem and get the same results from two minimizer functions.

I want to ask which function (iterateMinimizersSimple and iterateMinimizersReal) is used in the current program. If you have time, could you help me check if the output of iterateMinimizersSimple and iterateMinimizersReal are really different or there is something wrong on my computer environment.

Thanks, Jason

maickrau commented 5 months ago

The release version uses only iterateMinimizersReal. iterateMinimizersSimple is there only for debug purposes, it's the naive O(n*k*w) implementation of minimizer winnowing while iterateMinimizersReal is the O(n) algorithm. Both functions are active if the compilation flag EXTRACORRECTNESSASSERTIONS is set, in which case it runs both algorithms and checks that the results are identical. Turns out that at some point a bug got introduced into the O(n) algorithm which caused some minimizers to be discarded, so that's why the outputs were different. The most recent commit in branch develop now has the bug fixed.