ksahlin / strobemers

A repository for generating strobemers and evalaution
71 stars 11 forks source link

How to get the most random randstrobes in compiled languages? #8

Open ksahlin opened 2 years ago

ksahlin commented 2 years ago

Background

Randstrobes is a method for linking two or more k-mers together into a seed that can be used for sequence mapping. I intend the below post to be a self-contained rough description of randstrobes (but just in case, more formal details on strobemers are found here).

Creating a randstrobe

We obtain a randstrobe r consisting of two linked k-mers k_1 and k_2 by fixing the position of k_1 and selecting k_2 in a window [w_1, w_2] bases downstream from k_1 through some function. The aim was to select k_2 in a way that would yield a random but deterministic sampling of k_2 in the window. Strobemers was introduced as a concept, i.e., I didn't care to study the optimal method for selecting k_2. Originally, selecting k_2 was described as

k_2 = argmin_{k' \in W} ( h(k_1) + h(k') ) % p

where W denotes the window to pick from. I will denote this as method 1. It was realized by @shenwei356 in #1 (this post specifically) that MOD operation is relatively expensive, and that we could instead use bit arithmetic to do

k_2 = argmin_{k' \in W} ( h(k_1) + h(k') ) & q

where q is a bitmask and & is the bitwise AND operator. I will denote this as method 2

Problem

While we cannot achieve full randomness (nothing really is), it seems to be room for significant improvement on the randomness of sampling k_2 while keeping speed. Below I pasted the example output from method 2 where we are picking k_1 and sampling k_2 over a thinned space of syncmers as described in this preprint.

In the below output p1 and p2 denotes the positions of k_1 and k_2 on a metagenomic contig sequence, respectively. Repetitions occurring two times are marked with a * and repetitions occurring more than two times with **. Also, this is not a cherry-picked example, it actually seems to get worse later on in the sequence.

p1 p2 
-----
0 78
4 60
8 37
13 92
17 85 *
24 85 *
28 98
32 113
37 92
41 88
48 95 *
60 95 *
68 133
74 106 *
78 106 *
85 143
88 121 *
92 121 *
95 149 **
98 149 **
106 149 **
113 149 **

We can alleviate the repetitions by instead using a method (method3) I just came up with, which is:

k_2 = argmin_{k' \in W} bitcount( h(k_1) ^ h(k') )

which picks the k' with the smallest number of bits set after an XOR operation. See below the reduced repetitiveness and lower correlation. For further motivation, method3 increases the accuracy of strobealign by about 0.1% (preliminary small experiments), which is a big deal as strobealign was previously about 0.1-0.2% less accurate than BWA and minimap2. So this matters for read mapping!

p1 p2 
0 28
4 41
8 85 *
13 48 *
17 48 *
24 92
28 88 *
32 85 *
37 78
41 121
48 88 *
60 95
68 126 *
74 113
78 116
85 126 *
88 121
92 158
95 130
98 133
106 149 *
113 149 *

While some repetitions are expected even under randomness, it is good to reduce degenerate cases such as the ** case seen above.

Open questions

  1. How to measure "randomness" in picking k_2?
  2. Which operation gives the best randomness? (is there a better one than method3?)
  3. Which operation is the fastest?

The three methods are all implemented here in a strobealign commit

marcelm commented 2 years ago

with the smallest number of bits set after a Bitwise OR operation

That should be XOR, right?

If you use method3, note that there’s a POPCNT instruction that recent CPUs support. It’s quite possible that bitset.count already uses it, but you should make sure. If your CPU has support and you use GCC, you can enable POPCNT with -mpopcnt. If bitset.count doesn’t use POPCNT, you can also use the GCC builtin __builtin_popcount (with -mpopcnt, this will compile to the POPCNT instruction. Without it, it uses a function.)

ksahlin commented 2 years ago

Yes, I meant XOR, will fix it!

Thanks for this info on C++. Still so much to learn..

ksahlin commented 2 years ago

As pointed out by @ekimb, the hash function matters. In my above experiments, I used simple bit encoding A=00, C=01, G=10, T=11 without any jumbling of the bits (through a hash function). Below I show the "randomness" of

  1. method2 together with applying robin_hash C++ hash function to the k-mers.
  2. method2 together with applying hash64 hash function from minimap2 code
  3. method3 together with applying hash64 hash function from minimap2 code

The randomness looks better under Heng Li's hash64 function than under robin_hash

1.

4 36
10 60 **
16 82 *
19 60 **
23 60 **
27 60 **
32 82 *
36 102
41 96 **
46 86
50 82 *
53 124 *
60 96 **
66 96 **
72 141 **
78 141 **
82 124 *
86 141 **
93 167
96 141 **

2.

2 74 *
14 63 *
22 63 *
26 84 **
29 74 *
34 84 **
37 113
46 117
54 84 **
60 105
63 127 *
69 121 **
74 109
80 121 **
84 121 **
92 165 *
97 127 *
105 156
109 142
113 148
117 165 *
121 174
127 192 *
131 186
139 200
142 192 *

3.

2 63 *
14 46
22 84 *
26 63 *
29 97
34 80
37 113 *
46 84 *
54 127 *
60 131 **
63 113 *
69 131 **
74 131 **
80 131 **
84 156 *
92 121
97 127 *
105 156 *
109 152
113 142
117 174 *
121 186
127 174 *
131 192
139 177 *
142 177 *
lh3 commented 2 years ago

Method 1 and 2 would not work well. Method 3 is better. I would do

k_2 = argmin_{k' \in W} h(k_1∘k')

k_1∘k_2 denotes the concatenation of k-mer k_1 and k_2 and h() can be minimap2's invertible hash function if the total k-mer length is below 32.

PS: just noticed that Daniel Liu suggested the nearly same thing on twitter except that he added a mask at the end. If you use an invertible hash function, you don't need to apply a mask again.

Also on notation, { k' | (h(k_1) + h(k')) % p } is a set of k-mers. Applying argmin to this set doesn't have a clear meaning. Daniel's and my notation should be the correct one.

ksahlin commented 2 years ago

Thanks for those points @lh3.

For documentation, I will implement a small bake-off as soon as I have time, which will include the following

Hashing

  1. No hash - simple 2bit encoding of nucleotides
  2. Thomas Wang hash
  3. xxhash3

Let the hash function be denoted by h.

Linking

  1. Sahlin ( h(k_1) + h(k') ) % p (method1 above)
  2. Shen ( (h(k_1) + h(k') ) & p (method2 above)
  3. Sahlin bitcount( h(k_1) ^ h(k') ) (method3 above)
  4. Pibri h(k_1) ^ h(k') (global XOR), (also a variant: h(k_1 ^ k'))
  5. Liu-Patro-Li h( k_1 || k' ) (concatenation)

Viable combinations of (hashing, linking) seem to be (1,1)-(1,4), (2,1)-(2,4), (3,1)-(3,5) as the total length can be larger than 32.

@ekimb's idea is potentially wild - it must be investigated - but it is conceptually different and will not be considered in this bake-off.

Metrics

jermp commented 2 years ago

A little update on what I proposed: it is h(k_1) ^ h(k') , i.e., the XOR between the two hash codes, not just the hash of the XOR between the 2-bit encoding of the kmers (which is h(k_1 ^ k')) The reason is that the XOR of two hash codes is another random quantity, the XOR between two bit-packed kmers will not be that random even if k' changes. h should be a hash function (like the ones you mentioned or murmur2).

ksahlin commented 2 years ago

Ah, I see, sorry. Will try both of them.