ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
151 stars 17 forks source link

Use of data prefetching? #203

Open jkbonfield opened 1 year ago

jkbonfield commented 1 year ago

This is just an idea, which may go nowhere at all. :)

Profiling strobealign I see the most CPU hungry function is find_nams, and the most CPU intensive bit of that I think is

https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L282

 for (auto &q : query_mers) {
        auto ref_hit = index.find(q.hash);
        if (ref_hit != index.end()) {

Perf record / report shows (I've no idea how to get this working with debug info and cmake):

  0.00 │       shr    $0x21,%rdx                                                    ▒
  0.06 │       xor    %rdx,%rax                                                     ▒
  0.13 │       mov    %rax,%rdx                                                     ▒
  0.04 │       shr    $0x5,%rax                                                     ▒
  0.01 │       and    %r11,%rax                                                     ▒
  0.01 │       and    $0x1f,%edx                                                    ▒
  0.06 │       shr    %cl,%rdx                                                      ▒
  0.07 │       lea    (%r8,%rax,1),%rcx                                             ▒
  0.06 │       shl    $0x4,%rax                                                     ▒
  0.08 │       add    %esi,%edx                                                     ▒
  0.01 │       add    %r10,%rax                                                     ▒
  0.18 │110:   movzbl (%rcx),%r13d                                                  ▒
 39.06 │       mov    %rcx,%r9                                                      ▒
  0.03 │       cmp    %edx,%r13d                                                    ▒
  0.00 │       je     4416d0 <find_nams(std::vector<nam, std::allocator<nam> >&, std▒
  0.98 │11c:   movzbl 0x1(%rcx),%r13d                                               ▒
  0.37 │       add    %esi,%edx                                                     ▒
  0.00 │       cmp    %edx,%r13d                                                    ▒
       │       je     441a2f <find_nams(std::vector<nam, std::allocator<nam> >&, std▒
  0.32 │12c:   movzbl 0x2(%r9),%r9d                                                 ▒
  0.16 │       add    %esi,%edx                                                     ▒
       │       add    $0x2,%rcx                                                     ▒
       │       add    $0x20,%rax                                                    ▒

Ie 39% of all CPU for this function is spent waiting on one of those memory moves (it's sometimes the instruction before or two than the one reported, due to pipelining). This is to be expected in any application which is using a large hash table and randomly jump around main memory. I expect perf stat would tell me it's frontend or backend idle times, but sadly this machine is virtual and isn't exposing individual hardware CPU counters.

I've had experience elsewhere on speeding up memory fetches by computing the address that's going to be used in a couple loop iterations time and manually doing a hardware prefetch. It's then in cache by the time we get around to using it.

Here we may even be able to do this by doing something like:

auto ref_hit = index.find(q.hash);
if (ref_hit_last != index.end()) {
...
}
ref_hit_last = ref_hit;

So while it's fetching the next hit it's processing the previous one.

I'm a low level C coder though and unfortunately know nothing about C++, so how to get things like next value out of an implicit iterator is beyond me.

Has anyone looked at the possibility of improving instruction pipelining by prefetching memory addresses? I can't say how much difference it'll make without trying it, but as I say doing that in C++ is beyond my knowledge.

marcelm commented 1 year ago

Hi James,

good idea! I was actually doing some profiling yesterday (on a physical machine) and looked at that exact line in the perf output, but didn’t know what to do about it. And thanks for confirming this is due to memory access; I had a suspicion but wasn’t quite sure about this. Which genome did you use for the profiling? I tested with D. melanogaster and didn’t get quite such a large percentage for that line. Also, the find_nams function was only responsible for ~10% of the total runtime. But that only means that mapping against Drosphila won’t benefit as much if we can optimize this.

Perf record / report shows (I've no idea how to get this working with debug info and cmake):

You just need to run CMake with -DCMAKE_BUILD_TYPE=RelWithDebInfo and then it’ll work. (At the moment, this also enables assertions, but I may change that, and it doesn’t slow down things anyway IIRC.)

[...]

auto ref_hit = index.find(q.hash);
if (ref_hit_last != index.end()) {
...
}
ref_hit_last = ref_hit;

So while it's fetching the next hit it's processing the previous one.

I’ll try that.

marcelm commented 1 year ago

I’m not sure entirely, but I think most of the time for my test dataset is not spent not in the hash table lookups, but in the code after it (the for (auto &it : hits_per_ref) { loop). I need to look a bit closer, though, because perf report output of optimized code is hard to interpret even with debug symbols. And I should test with a larger genome.

marcelm commented 1 year ago

I made a mistake in my first assesment. The loop with the hash table lookups is responsible for a large chunk of the runtime of find_nams after all.

Now the problem is that I need to figure out how to actually prefetch the next hash table entry. I tried your suggestion above, but there’s no difference. I think we would need to issue an actual prefetch instruction that makes the memory access in the background while the CPU continues to do its work.

I’ll need to come back to this later.