ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Memory reduction #173

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

I’ve been trying to understand the index data structures. I’ll summarize here for reference, but will also write down a suggestion for how to perhaps reduce memory usage.

  1. A vector of randstrobes is created ("flat vector"). The elements are structs ("MersIndexEntry") with the following members:

    • hash (8 bytes)
    • position (4 bytes)
    • reference index
    • second strobe offset (packed together with the reference index into a 4-byte field)

      The vector is sorted by hash.

  2. The randstrobes vector is split up into two vectors:

    • One vector (h_vector) contains only the hashes
    • One vector (mers_index) contains the other fields (position and packed reference index/second strobe offset)

    Both vectors need the same amount of memory (8 bytes per element).

  3. The flat vector is deallocated.
  4. A hash table (kmer_lookup) is created from the h_vector that maps a hash value to a (mers_index position, count) pair. The key needs 8 bytes and the value needs 4+4=8 bytes as well. The hash table has low overhead, so memory usage per randstrobe is only a little bit higher than 16 bytes.

    To determine which randstrobes exist for a given k-mer, it is looked up in kmer_lookup, resulting in an index and a count, and then its info is found at mers_index[index:index+count].

  5. The h_vector is deallocated.

An issue with this procedure is that the hash values exist twice at the end of step 4, increasing peak memory usage. One way to solve this would be to write the hash values to disk when splitting up the flat vector, and then read them back in when creating the hash table. This may not even be slower if enough RAM is available because the file would be in the filesystem cache.

The duplicate h_vector does not exist when using a pre-generated index. It contains only the kmer_lookup and mers_index tables.

Here is another suggestion that should have greater impact. I noticed that the count field takes up 4 bytes but contains the value 1 most of the time (most strobemers are unique), which seems a bit of a waste. Reducing the count to a single byte was my first idea, but that is difficult to achieve because the struct size would be padded to 8 bytes anyway (to ensure aligned memory access).

Instead, I think we can get rid of count entirely if we use one bit in the mers_index entry to indicate whether there is another entry after it that has the same hash. Instead of iterating over mers_index[index:index+count], one would start at mers_index[index] and iterate until that bit was unset.

There are some places in the code where the count attribute is used directly. This requires constant time now, but would need to change to an iteration. If that turns out to be a problem, we could add a second hash table that maps hash values to counts, but only if the count is greater than 1 (or some constant). That table should be quite small.

ksahlin commented 1 year ago

These are great ideas! Also, your description of the "process" is correct. The only thing I am a bit hesitant about is "The hash table has low overhead", from what I read it has high overhead. However, this does not invalidate your suggestions. It just makes your suggestions more significant, as the size of the hash table will decrease quite a bit with your second idea.

With your second idea, do you mean to, e.g., put the most significant bit in the 4-bytes position entry to 1 if the following entry corresponds to a new hash value? Then iterate over the vector starting from mers_index[index] and continue as long as it is 0. This would lead to still having capacity for 2^31 = 2.1B size chromosomes, which works for the large majority of organisms.

Any prediction about cache misses and runtime? When iterating mers_index[index:index+count] the whole block will probably be pul in some high priority cache level? I am not sure we would have the same efficiency if we don't know this beforehand. (just guessing here).

marcelm commented 1 year ago

The only thing I am a bit hesitant about is "The hash table has low overhead", from what I read it has high overhead.

I got that from the hash table benchmark. The author says:

100 million int-int pairs take at least 1526 MB. and then he measured the robin_hood:unordered_flat_map (which we use) to need 1717 MB.

So it would appear that the overhead in that case is only ~12%. (I am a bit confused by the numbers because I get 1526 MB only if I assume that an int uses 8 bytes, but it typically has 4 bytes.)

Looking at this a bit closer, I think the 10% are highly optimistic because they happen to be close to the lowest possible overhead: Because the table can only have sizes that are powers of two and the maximum load factor is 0.8, I assume a table with 1E8 elements has a size of 2^27=134217728 with a load factor of 0.74 (1E8/2^2). With a couple more elements, the table would resize and the load factor would become 0.4 with an overhead much more than just 10%.

I’ll need to reply to your other comments later.

ksahlin commented 1 year ago

Your reasoning is more thorough than mine, my comment was based on "anecdata" :)

If I recall correctly, in v0.7.1:

  1. the flat_vector was using three slots: hash (8 bytes) position (4 bytes), reference index + second strobe offset (4 bytes)
  2. the hash table was using 8 bytes as keys and 4+4 bytes as values for position and count.
  3. The reference strings took some space.

Still, I remember seeing around 60-70% of memory being occupied by the hash table. This led me to believe that the representation of the hash table used a lot of 'extra memory' (not only because of extra slots, but perhaps due to other things with the layout).

I think robin hood has two different hash table implementations with tradeoff in speed/memory see here. If the implementation is not specified, it tries to infer the best kind based on the data. In strobealign we don't specify the implementation I think.

ksahlin commented 1 year ago

I now realize that the second point you propose (discarding count) may not be as simple as my suggestion. One thing that is convenient with having the explicit count is that it:

  1. is easy to ignore repetitive seeds by a quick check if seed occurrence is higher than filter_cutoff here and skip forward.
  2. use the counts and sort the seeds in order of repetitiveness when we need to enter rescue mode find_nams_rescue.

A solution for 1 might be to use an additional bit that indicates the seed has higher abundance than filter_cutoff.

I am still not sure how to deal with 2. A possible solution may be to use "some more bits" for counts in even intervals between filter_cutoff and the hard cutoff 1000, and sort based on that. For example 3 more bits may signal counts more than 250,500,750. At this point the algorithm is not guaranteed to perform the same as base line. And we have 5-6 bits representing

Bit 1: is identical to next hv (same seed) Bit2: is higher than filter cutoff Bit3: is hard masked, i.e., count >1000. (or if this is filtered out already at indexing time we may save this bit*) Bit 4-5: count higher than e.g., 250,500

The problem is that using anything more than 3 bits is going to be restricting larger than human sized organisms (2^(32-3) =537M). The two bits for is identical to next hvand is higher than filter cutoff seems viable though. The rescue function is only implemented because it was the best solution I could come up with to speed up finding merged matches (NAMs in code - but not exactly the same definition as NAM in the first strobemers paper) for very repetitive seeds in strobealign.

(* Since I am pretty sure we never use seeds of occurrence 1000 and above (see e.g. here https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L130), it may be possible to discard them at indexing step or so? The hardcoded 1000 should probably be a hidden parameter though.)

marcelm commented 1 year ago

If I recall correctly, in v0.7.1:

  1. the flat_vector was using three slots: hash (8 bytes) position (4 bytes), reference index + second strobe offset (4 bytes)

  2. the hash table was using 8 bytes as keys and 4+4 bytes as values for position and count.

  3. The reference strings took some space.

That is still correct. Here are some rough numbers for fruit fly (25 million strobemers):

So 50% for the hash table is correct. If I load the index from disk, the h_vector is never created and as expected, peak memory usage goes down to ~950 MB.

I think robin hood has two different hash table implementations with tradeoff in speed/memory see here. If the implementation is not specified, it tries to infer the best kind based on the data. In strobealign we don't specify the implementation I think.

Yes, I tested that and in our case, the unordered_flat_map is chosen automatically. Maybe I should force using the other variant just to see what happens.

Since I am pretty sure we never use seeds of occurrence 1000 and above (see e.g. here https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L130), it may be possible to discard them at indexing step or so? The hardcoded 1000 should probably be a hidden parameter though.

Good idea. It seems though, that this won’t help that much. I looked at the statistics for the fruit fly, and only 10 seeds (out of 25 million) have a count >= 1000 (the highest has 5953).

Regarding your thoughts about omitting the count: I agree that using a single bit in this way isn’t that helpful if getting the counts in constant time is required. And spending more and more bits on this won’t work either, as you observed.

I spent some time today to implement a variant where counts greater than 1 are kept in a separate hash table. If the seed exists in the main hash table, but not in the counts table, it must be 1. For the fruit fly, only 3% of the seeds have a count greater one, so the memory usage is very low compared to the main table. There is some slowdown because there are now two hash table lookups, but this can probably be mitigated.

As it turns out unfortunately, this doesn’t save any memory at all, very likely because the 8-byte key forces the hash table entries to be aligned at multiples of 8 bytes. I was afraid of this before, but didn’t quite know how the entries are stored internally, so I thought it was worth an attempt. I’m still thinking about other ways to achieve some memory savings so I’m not quite ready to give up. It seems such a waste - there are some many zeros in these tables ...

Next, I’ll look into flushing the h_vector to disk. That should hopefully work.

ksahlin commented 1 year ago

Yes, I tested that and in our case, the unordered_flat_map is chosen automatically. Maybe I should force using the other variant just to see what happens.

Yes, sounds lika a good experiment.

Good idea. It seems though, that this won’t help that much. I looked at the statistics for the fruit fly, and only 10 seeds (out of 25 million) have a count >= 1000 (the highest has 5953).

Okay, I guess it won't help as much as I thought. But I think we should use hg38 (or even CHM13) as a standard for what could be expected in data. CHM13 should definitely have much more of those seeds (from telomere regions). I am suspecting plant genomes like maize and rye are even worse. But overall, maybe this won't lead to a groundbreaking improvement.

I spent some time today to implement a variant where counts greater than 1 are kept in a separate hash table. If the seed exists in the main hash table, but not in the counts table, it must be 1. For the fruit fly, only 3% of the seeds have a count greater one, so the memory usage is very low compared to the main table. There is some slowdown because there are now two hash table lookups, but this can probably be mitigated.

This sounds like a good idea. If the keys in the two tables are disjoint, we would need two lookups only if the seed occurs more than once, or is not present in reference. Otherwise it is implied that the key is present only once.

As it turns out unfortunately, this doesn’t save any memory at all, very likely because the 8-byte key forces the hash table entries to be aligned at multiples of 8 bytes.

This got me thinking of using 32 bit keys instead as proposed in https://github.com/ksahlin/strobealign/issues/27.

Another idea: how about a 2 byte integer for counts? If there is also a way to return a 4+2 bytes as hash keys, then the alignment match between keys and values. But maybe it still works in even 8byte chunks. Don't know.

Next, I’ll look into flushing the h_vector to disk. That should hopefully work.

Sounds good

marcelm commented 1 year ago

Closing this issue because we achieved memory reduction in #191. (We still need to discuss 32-bit hashes, but can do that in #27.)