Use a newline index for FastqIter

rhpvorderman commented 1 year ago

Currently the __next__ function is quite messy. Memchr is called several times and several times a NULL check needs to be performed. Any of these conditions might exit the while loop that is put in place.

Alternatively the update_buffer call can be used to create a newline index. The __next__ call can then simply get the next four lines from the index and yield a record.

Advantages:

For each next call it is known in advance whether a SequenceRecord can be yielded as the number of records in the buffer is known.
It can massively simplify the __next__ call, which now uses a while loop to allow for repeated buffer resizings. Instead the newline indexing step can indicate if there is an entire record in the buffer, which means update_buffer can guarantee that at least one FASTQ record is present. Which makes the while loop redundant. Also a lot less NULL checks. Basically only FASTQ integrity checks.
The newline indexing step can be run as a very simple function based on memchr. Such simple loops are fairly simple (and therefore fast) to execute. (CPUs like to do the same thing over and over on a limited amount of memory)

Disadvantages.

A newline_index array must be made. This can be an array of pointers or an array of offsets.
A newline_pos variable must be added to the struct.
Memory is read essentially twice. Once for making the index and another time for making the integrity (@ and +) and \r checks. This means these cache lines need to be fetched from memory/cache again. In contrast, they were probably already populated with the correct memory in our current solution.

In case there is no notable speedup, a reduction in code complexity is also nice. (Taking the line diff as measure).

rhpvorderman commented 1 year ago

This didn't help for performance. Also it made the code slightly more complex. The next call got easier, but that was offset by added complexity elsewhere. So this experiment can be considered failed.

marcelm commented 1 year ago

Thanks for doing that experiment and documenting it here.

rhpvorderman commented 1 year ago

In the meantime I did another one: what if I only count the newlines but do not index them. The result is much simpler code in the next call, with only a very simple and a conceptually simple call count_newlines (people can infer functionality from the function name so it does not add to cognitive complexity). By counting newlines we can also ensure a FASTQ record is complete before parsing it. This makes the while loop and all the NULL checks obsolete. The disadvantage is that in essence that you are doing the same work twice (counting newlines also with memchr) and this adds more than 10% overhead. It can be reduced using a customized SIMD routine. But that sort of defeats the purpose of getting simpler code.

The resulting __next__ code is here: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/_core.pyx#LL502C1-L599C23

It also resulted in a very fast SIMD enhanced newline counter. I am posting it here so it may be reused for some other purpose later. I initially thought of using the popcount instruction and the movemask instruction, but popcount is not a part of SSE2. I think the following solution is much faster anyway: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/count_newlines_sse2.h

The reason I have time to do this is that I am wrapping up a lot of work today before going on holiday (to move house). So in between meetings and e-mails I decided to do some educational hacking.

marcelm commented 1 year ago

Nice educational hacking you did there :smile:! I’m just getting started again after my holidays.

I looked at the SIMD newline counting function just to learn a bit about SSE intrinsics. I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it? See https://godbolt.org/z/EGax1G4fG, line 56-105 in the assembly output. SO says that _mm_sad_epu8 may help.

rhpvorderman commented 1 year ago

I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it?

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code. It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

SO says that _mm_sad_epu8 may help.

Thanks! That is great! That looks so much more convenient. I will have a look at it.

marcelm commented 1 year ago

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code.

Sorry I didn’t make this clear: I know, my point was that it works on each byte individually instead of in a vectorized fashion.

It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue.

rhpvorderman commented 1 year ago

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue

I blame that on unclear commenting on my part.

The following updated version adds some more comments, clear vector variable names and uses _mm_sad_epu8 as per your excellent find. I knew there are instructions with hadd (horizontal add) in the name, and that these were not in SSE2, so I figured it was not possible. But I was wrong.

https://github.com/rhpvorderman/dnaio/blob/cdeb92431f12121695b90cd5cc33312fae811a62/src/dnaio/count_newlines_sse2.h

This looks much better in the compiler explorer. Unfortunately I can't benchmark it properly on this laptop and I haven't assembled my home PC yet (lots of it is still in boxes).

I do wonder if due to modern processors having branch prediction etc. if it would be faster to include _mm_sad_epu8 in the inner loop and then directly add to the 64 byte integer. That would make the outer loop redundant and the code much simpler. It should be slower. But with modern CPUs, I don't know if counting latencies is good enough if the code is simpler and the branches can be predicted a lot better.

marcelm commented 1 year ago

Thanks, and cool that it works with _mm_sad_epu8.

rhpvorderman commented 1 year ago

I am back and can now benchmark on my own machine again. Unfortunately the _mm_sad_epu8 solution is slower. I did a count and found that if the compiler was very naive and assigned a new vector register for each differently named variable it would need 9 vector registers (out of 8 available). The _mm_storeu option is faster and easier to understand: https://github.com/rhpvorderman/dnaio/blob/b688c50cdeef807e61a694af9bbc9a45ac0c313f/src/dnaio/count_newlines_sse2.h

marcelm commented 1 year ago

Welcome back and thanks for testing! As you explained to me, since that part of the code isn’t called that often, any speedup wouldn’t have been that large anyway, so I guess it doesn’t matter. But good you added the comment.

marcelm / dnaio

Use a newline index for FastqIter #108