Closed rhpvorderman closed 1 year ago
This didn't help for performance. Also it made the code slightly more complex. The next call got easier, but that was offset by added complexity elsewhere. So this experiment can be considered failed.
Thanks for doing that experiment and documenting it here.
In the meantime I did another one: what if I only count the newlines but do not index them.
The result is much simpler code in the next call, with only a very simple and a conceptually simple call count_newlines
(people can infer functionality from the function name so it does not add to cognitive complexity). By counting newlines we can also ensure a FASTQ record is complete before parsing it. This makes the while loop and all the NULL checks obsolete. The disadvantage is that in essence that you are doing the same work twice (counting newlines also with memchr) and this adds more than 10% overhead. It can be reduced using a customized SIMD routine. But that sort of defeats the purpose of getting simpler code.
The resulting __next__
code is here: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/_core.pyx#LL502C1-L599C23
It also resulted in a very fast SIMD enhanced newline counter. I am posting it here so it may be reused for some other purpose later. I initially thought of using the popcount instruction and the movemask instruction, but popcount is not a part of SSE2. I think the following solution is much faster anyway: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/count_newlines_sse2.h
The reason I have time to do this is that I am wrapping up a lot of work today before going on holiday (to move house). So in between meetings and e-mails I decided to do some educational hacking.
Nice educational hacking you did there :smile:! I’m just getting started again after my holidays.
I looked at the SIMD newline counting function just to learn a bit about SSE intrinsics. I am not quite sure, but is it possible that the count += accumulated_counts[i];
loop is quite inefficient the way GCC compiles it? See https://godbolt.org/z/EGax1G4fG, line 56-105 in the assembly output. SO says that _mm_sad_epu8 may help.
I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it?
It unrolls the loop. That is actually quite efficient as it eliminates branches from the code. It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.
Thanks! That is great! That looks so much more convenient. I will have a look at it.
It unrolls the loop. That is actually quite efficient as it eliminates branches from the code.
Sorry I didn’t make this clear: I know, my point was that it works on each byte individually instead of in a vectorized fashion.
It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.
Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue.
Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue
I blame that on unclear commenting on my part.
The following updated version adds some more comments, clear vector variable names and uses _mm_sad_epu8 as per your excellent find. I knew there are instructions with hadd
(horizontal add) in the name, and that these were not in SSE2, so I figured it was not possible. But I was wrong.
This looks much better in the compiler explorer. Unfortunately I can't benchmark it properly on this laptop and I haven't assembled my home PC yet (lots of it is still in boxes).
I do wonder if due to modern processors having branch prediction etc. if it would be faster to include _mm_sad_epu8
in the inner loop and then directly add to the 64 byte integer. That would make the outer loop redundant and the code much simpler. It should be slower. But with modern CPUs, I don't know if counting latencies is good enough if the code is simpler and the branches can be predicted a lot better.
Thanks, and cool that it works with _mm_sad_epu8
.
I am back and can now benchmark on my own machine again. Unfortunately the _mm_sad_epu8 solution is slower. I did a count and found that if the compiler was very naive and assigned a new vector register for each differently named variable it would need 9 vector registers (out of 8 available). The _mm_storeu option is faster and easier to understand: https://github.com/rhpvorderman/dnaio/blob/b688c50cdeef807e61a694af9bbc9a45ac0c313f/src/dnaio/count_newlines_sse2.h
Welcome back and thanks for testing! As you explained to me, since that part of the code isn’t called that often, any speedup wouldn’t have been that large anyway, so I guess it doesn’t matter. But good you added the comment.
Currently the
__next__
function is quite messy. Memchr is called several times and several times aNULL
check needs to be performed. Any of these conditions might exit the while loop that is put in place.Alternatively the
update_buffer
call can be used to create a newline index. The__next__
call can then simply get the next four lines from the index and yield a record.Advantages:
__next__
call, which now uses a while loop to allow for repeated buffer resizings. Instead the newline indexing step can indicate if there is an entire record in the buffer, which means update_buffer can guarantee that at least one FASTQ record is present. Which makes the while loop redundant. Also a lot less NULL checks. Basically only FASTQ integrity checks.Disadvantages.
@
and+
) and\r
checks. This means these cache lines need to be fetched from memory/cache again. In contrast, they were probably already populated with the correct memory in our current solution.In case there is no notable speedup, a reduction in code complexity is also nice. (Taking the line diff as measure).