Open ksahlin opened 3 years ago
with the smallest number of bits set after a Bitwise OR operation
That should be XOR, right?
If you use method3, note that there’s a POPCNT instruction that recent CPUs support. It’s quite possible that bitset.count
already uses it, but you should make sure. If your CPU has support and you use GCC, you can enable POPCNT with -mpopcnt
. If bitset.count doesn’t use POPCNT, you can also use the GCC builtin __builtin_popcount
(with -mpopcnt
, this will compile to the POPCNT instruction. Without it, it uses a function.)
Yes, I meant XOR, will fix it!
Thanks for this info on C++. Still so much to learn..
As pointed out by @ekimb, the hash function matters. In my above experiments, I used simple bit encoding A=00, C=01, G=10, T=11 without any jumbling of the bits (through a hash function). Below I show the "randomness" of
method2
together with applying robin_hash C++ hash function to the k-mers.method2
together with applying hash64 hash function from minimap2 codemethod3
together with applying hash64 hash function from minimap2 codeThe randomness looks better under Heng Li's hash64 function than under robin_hash
1.
4 36
10 60 **
16 82 *
19 60 **
23 60 **
27 60 **
32 82 *
36 102
41 96 **
46 86
50 82 *
53 124 *
60 96 **
66 96 **
72 141 **
78 141 **
82 124 *
86 141 **
93 167
96 141 **
2.
2 74 *
14 63 *
22 63 *
26 84 **
29 74 *
34 84 **
37 113
46 117
54 84 **
60 105
63 127 *
69 121 **
74 109
80 121 **
84 121 **
92 165 *
97 127 *
105 156
109 142
113 148
117 165 *
121 174
127 192 *
131 186
139 200
142 192 *
3.
2 63 *
14 46
22 84 *
26 63 *
29 97
34 80
37 113 *
46 84 *
54 127 *
60 131 **
63 113 *
69 131 **
74 131 **
80 131 **
84 156 *
92 121
97 127 *
105 156 *
109 152
113 142
117 174 *
121 186
127 174 *
131 192
139 177 *
142 177 *
Method 1 and 2 would not work well. Method 3 is better. I would do
k_2 = argmin_{k' \in W} h(k_1∘k')
k_1∘k_2
denotes the concatenation of k-mer k_1
and k_2
and h() can be minimap2's invertible hash function if the total k-mer length is below 32.
PS: just noticed that Daniel Liu suggested the nearly same thing on twitter except that he added a mask at the end. If you use an invertible hash function, you don't need to apply a mask again.
Also on notation, { k' | (h(k_1) + h(k')) % p }
is a set of k-mers. Applying argmin to this set doesn't have a clear meaning. Daniel's and my notation should be the correct one.
Thanks for those points @lh3.
For documentation, I will implement a small bake-off as soon as I have time, which will include the following
Let the hash function be denoted by h
.
( h(k_1) + h(k') ) % p
(method1 above)( (h(k_1) + h(k') ) & p
(method2 above)bitcount( h(k_1) ^ h(k') )
(method3 above)h(k_1) ^ h(k')
(global XOR), (also a variant: h(k_1 ^ k')
)h( k_1 || k' )
(concatenation)Viable combinations of (hashing, linking) seem to be (1,1)-(1,4), (2,1)-(2,4), (3,1)-(3,5) as the total length can be larger than 32.
@ekimb's idea is potentially wild - it must be investigated - but it is conceptually different and will not be considered in this bake-off.
p2
is selected for k_2
(repetition distribution)p2
.A little update on what I proposed:
it is h(k_1) ^ h(k')
, i.e., the XOR between the two hash codes, not just the hash of the XOR between the 2-bit encoding of the kmers (which is h(k_1 ^ k')
) The reason is that the XOR of two hash codes is another random quantity, the XOR between two bit-packed kmers will not be that random even if k' changes.
h
should be a hash function (like the ones you mentioned or murmur2
).
Ah, I see, sorry. Will try both of them.
Background
Randstrobes is a method for linking two or more k-mers together into a seed that can be used for sequence mapping. I intend the below post to be a self-contained rough description of randstrobes (but just in case, more formal details on strobemers are found here).
Creating a randstrobe
We obtain a randstrobe
r
consisting of two linked k-mersk_1
andk_2
by fixing the position ofk_1
and selectingk_2
in a window[w_1, w_2]
bases downstream fromk_1
through some function. The aim was to selectk_2
in a way that would yield a random but deterministic sampling ofk_2
in the window. Strobemers was introduced as a concept, i.e., I didn't care to study the optimal method for selectingk_2
. Originally, selectingk_2
was described ask_2 = argmin_{k' \in W}
( h(k_1) + h(k') ) % p
where
W
denotes the window to pick from. I will denote this asmethod 1
. It was realized by @shenwei356 in #1 (this post specifically) that MOD operation is relatively expensive, and that we could instead use bit arithmetic to dok_2 = argmin_{k' \in W}
( h(k_1) + h(k') ) & q
where
q
is a bitmask and&
is the bitwise AND operator. I will denote this asmethod 2
Problem
While we cannot achieve full randomness (nothing really is), it seems to be room for significant improvement on the randomness of sampling
k_2
while keeping speed. Below I pasted the example output from method 2 where we are pickingk_1
and samplingk_2
over a thinned space of syncmers as described in this preprint.In the below output
p1
andp2
denotes the positions ofk_1
andk_2
on a metagenomic contig sequence, respectively. Repetitions occurring two times are marked with a*
and repetitions occurring more than two times with**
. Also, this is not a cherry-picked example, it actually seems to get worse later on in the sequence.We can alleviate the repetitions by instead using a method (
method3
) I just came up with, which is:k_2 = argmin_{k' \in W}
bitcount( h(k_1) ^ h(k') )
which picks the
k'
with the smallest number of bits set after an XOR operation. See below the reduced repetitiveness and lower correlation. For further motivation,method3
increases the accuracy of strobealign by about 0.1% (preliminary small experiments), which is a big deal as strobealign was previously about 0.1-0.2% less accurate than BWA and minimap2. So this matters for read mapping!While some repetitions are expected even under randomness, it is good to reduce degenerate cases such as the
**
case seen above.Open questions
k_2
?method3
?)The three methods are all implemented here in a strobealign commit