Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. - Githubissues

samtools / htslib

C library for high-throughput sequencing data formats

Other

783 stars 447 forks source link

Make nibble2base faster using x86-64 pshufb instruction (SSSE3) and using dynamic dispatch. #1764

Closed rhpvorderman closed 1 month ago

rhpvorderman commented 3 months ago

See https://github.com/samtools/htslib/pull/1677 for prior discussion.

I made the PR such that nibble2base is dynamically dispatched on x86-64 cpus with SSSE3 instructions.

No build options need to be changed and nibble2base gets a nice speed up on the majority of the install base.

jkbonfield commented 2 months ago

I can confirm the code works and this function is considerably faster. Curiously despite using ssse3 intrinsics, that code uses different instructions if I enable -mavx2. I can also get it a bit faster on older platforms (but less different on modern ones) by interleaving more loads together (2 128bit lines) and removing intruction dependency latency. I'm guessing newer CPUs have fewer cycles for some of these or quicker memory loads so the latency vanishes. I didn't test AMD either. It's largely irrelevant though as another 10-20% speed up of that function doesn't matter once it's not a significant bottleneck in the overall performance.

However it's clumsy with portability.

__builtin_cpu_supports("ssse3") does indeed need gcc 4.8 and above, but clang always pretends to be GCC 4.2 for those sorts of checks so it doesn't use the optimised function. We normally check two ways, eg:

#if HTS_COMPILER_HAS(aligned) || HTS_GCC_AT_LEAST(4,3)

I don't know what the appropriate HTS_COMPILER_HAS would be to check for the ssse3 cpu support though. Maybe it just needs an equivalent clang version checker (which appears to need clang >= 3.9) maybe. Heaven help anyone attempting to use older pre-clang iccs, but it's not a huge issue as it just falls back to the old code anyway.

We'll probably be putting a new release together soon, but feel this is a bit bleeding edge still to incorporate. Thank you for the improvements. It will get merged, but please don't be put off by it not likely being in the next release.

rhpvorderman commented 2 months ago

@jkbonfield Thank you for the feedback. As you mention, this does only significantly speed up some use cases (uBAM + long reads) and only in the lower double digit percentages. So it is not as critical to get this out ASAP. In fact, I have already implemented this in my own code that only parses uBAM. I figured it would be nice to contribute it back to htslib so that all users may enjoy it eventually.

I didn't test AMD either

I did. It works. This should be faster on any ssse3 enabled platform simply due to the fact that memory lookup is almost always going to be slower than register action. In this case the shufb instruction has 1 cycle of latency, so memory can never beat that. L1 cache latency is typically 3 cycles.

Thank you for all your feedback on the previous PR as well. This urged me to learn dynamic dispatch and it is great! It enabled an AVX2 routine for one of my projects.

rhpvorderman commented 2 months ago

I can confirm the code works and this function is considerably faster. Curiously despite using ssse3 intrinsics, that code uses different instructions if I enable -mavx2.

This is expected. It will probably use ymm registers rather than xmm registers due to these being practically the same. Technically AVX2 machines can use _mm256_shuffle_epi8 and this might code into something even more compact.

However it's clumsy with portability.

I now have done the appropriate clang checks as well as according to https://clang.llvm.org/docs/LanguageExtensions.html#builtin-cpu-supports

target + __builtin_cpu_supports is very powerful. I am glad that clang also has it.

rhpvorderman commented 2 months ago

I had a look at the code. Using a broadcast instruction the 16 bytes can be loaded into two 128bit vectors within in a 256-bit vector. Instead of two sequences of operations, only one sequence of operations is needed.

The code will look like cleaner at the cost of supporting less computers. In theory it should be faster, but I think the boost in application performance compared to what is there now is negligible. For now I think it is better to merge the code as is. I may try it out on some other projects I work on before I update the htslib code in a new PR if it turns out to be worth it.

jkbonfield commented 2 months ago

Thanks. I'll take another look at this soon.

rhpvorderman commented 2 months ago

Sorry for the spam, but I want to close of the AVX2 possibilities. AVX2 is definitely not worth it. Tested on roughly 750MB of bases (in BAM format). Speedup is 20ms on the Skylake Intel CPU I am currently working on. It also saves 6 lines of code. Since SSSE3 is supported much more broadly and gets more than 90% of the speed gain, it is better to stick with that.

jkbonfield commented 2 months ago

I'm happy with fixing the low hanging fruit only. Once it's "fast enough" polishing the remainder is just adding complexity for little gain.

Thanks for the updates. I'll check them this week and hopefully get it merged.

rhpvorderman commented 2 months ago

It is never fast enough. As long as bioinformatics pipelines take longer than my coffee breaks, the performance is still subpar as far as I am concerned. Having said that, I don't mind taking a 20ms longer coffee break. ;-)

jkbonfield commented 1 month ago

Thanks. I've had a look at the algorithm which is pretty simple and well documented but very effective. The CPU detection is a bit of a change on how we've done it before, but I like the compiler auto-detection methods and automatic dispatch. As you say they're very powerful and maybe we can use these for other bits of code in the future where custom logic will help.

Thanks.

rhpvorderman commented 1 month ago

Thank you for not merging the initial PR, which encouraged me to look into dynamic dispatching. I have used that a few times now in other projects. It is especially useful to enable the pshufb instructions. These basically allow 16-entry lookup tables. Since a lot of code contains 256-entry or 128-entry (for ASCII text) lookup tables these can be many times be written with shuffle instructions. For example, dna only contains A,C,G,T. So rather than using some sort of lookup, there can be some quick checking using vector compare instructions with A,C,G,T. Then the resulting masks can be merged with bitwise OR, so we have an ACGT-only mask. If everything is ACGT, which we just checked with only a few instructions, we can do a bitwise AND with 0b1111, resulting in 4-bit indices that range from 0-15. Since A,C,G,T are distinct in the last few bits, and we verified our vector to be only be A,C,G,T we can then apply the shuffle.

A bit verbose, but in terms of CPU instructions it saves a lot of work. I used that for converting DNA to twobit representation in Sequali. See the code here: https://github.com/rhpvorderman/sequali/blob/a39ceee8a5b25668e068c0812719429386673dc3/src/sequali/function_dispatch.h#L259

jmarshall commented 1 month ago

Interesting discussion in the last couple of comments… 🤔

It is probably fine and the assignment to nibble2base is probably “atomic enough” on the platforms for which the code is activated, but is there a multithreading issue here?

rhpvorderman commented 1 month ago

@jmarshall That seems to be an accurate observation. I never considered that. Some locking structure in the dispatch would not hurt the performance. I doubt it will cause problems though. It should only cause problems when the pointer address is not written in one go, so there will be a partial address at the location. I think all the bits should be changed at once with the update as the pointer will be a native integer. I am not entirely sure though.

jkbonfield commented 1 month ago

A very good point and generally something I've thought of too (hence the plethora of pthread_once things around). Given the way this is implemented nibble2base_dispatch can do locking and doesn't need to be super efficient as it's a once-only affair by design.

Yes it ideally does need locking, although as you say it's probably "atomic enough" to never cause an issue in practice. On x86-64 that is, which in this case is what matters due to the ifdefs. In general though, it is not acceptable to assume aligned 64-bit writes on a 64-bit system will be atomic. Eg:

https://godbolt.org/z/vso1K7b64

(Edit: although that's still one store instruction, so maybe... It also depends on the memory back end and whether CPU instructions themselves are happening asynchronously)

rhpvorderman commented 1 month ago

Ah interesting. I'd like to add that while x86-64 compilers do compile a pointer write to one store instruction, that does not guarantee that the memory is written in one go by the CPU. If it writes byte by byte, and reads byte by byte on the backend then an issue might still occur. if I understood correctly the smallest memory unit that is retrieved and written to memory is the cacheline, which is 64 bytes, so there might not be a problem at all.

Adding mutexes won't hurt however.